Recent Advances in Data Engineering for Networking

This tutorial paper examines recent advances in data engineering, focusing on aspects of network management and orchestration. We provide a comprehensive analysis of standardization efforts as well as platform development activities related to data engineering driven network design. We then focus on the integration aspects of the data engineering ecosystem and telecommunication networks. The results of our tutorial investigation show that despite various efforts towards standardization and network management and orchestration platforms, there is still a significant gap in applying recent developments in the evolving data engineering world to the telecommunication domain. New advanced functionalities in data engineering as well as clear separations between the building blocks of data engineering pipelines within the proposed standardized architectures have been overlooked or not explored in detail by the standardization or platform development bodies in the telecommunication domain. Therefore, at the end of the paper, we discuss these gaps and research challenges in the context of future development processes for data engineering-driven network design and applications of data engineering concepts in telecommunication networks. We also propose several recommendations for early adoption of these technologies and frameworks in telecommunication infrastructures and platforms.


I. INTRODUCTION
Over the past few decades, telecommunication operators and service providers have experienced exponential growth in connectivity. At the same time, there has been an increased demand for massive connectivity, huge amounts of data, and in some cases ultra-low latency communications. Because of this, network complexity has increased placing a tremendous burden on telecommunication providers to manage and orchestrate the network. To address the highly complex issues that such larger and highly integrated networks pose in the design, analysis, deployment, and management phases, recent advances in data science and engineering technologies in both academia and industry, have encouraged the adoption of various Artificial Intelligence (AI)/Machine Learning (ML) platforms and frameworks at different layers of the telecommunication network infrastructure. On the other hand, the advanced techniques used by large companies, such as Google, Facebook, NetFlix, Apple, and Amazon, to demonstrate how to leverage data, have led to major breakthroughs in the business landscape in terms of improving products, services, and customer experiences. As a result of the increasing computing power of computers and advances in AI/ML algorithms from IT and the cloud giants, new data processing capabilities are being introduced that have disrupted entire industries, including telecommunications. Therefore, improving network intelligence has been the focus of interest in the telecommunication world in recent years with many diverse and compelling use cases [1], [2], [3], [4], [5], [6], [7]. (Note that the detailed use cases and their classifications within each Standards Developing Organizations (SDOs) and alliance bodies are given in Section XI.) Advanced algorithms (e.g., neural network-based Deep Learning (DL) algorithms) and computational patterns used in AI/ML platforms can help discover valuable information hidden in vast amounts of numerical data (images, videos, datasets, etc.). This will enable better decision making in the development of telecommunication infrastructure products, services and applications, while opening up innovative business opportunities. The recently evolving AI/ML ecosystem, its open source community [8], partnerships between the public private industry and academia [9], and the results from the labs of large cloud and IT giants like Facebook AI Research (FAIR) * , Google AI † and Microsoft AI ‡ etc. will support and enhance data-driven intelligence by gaining real-time (from streamed data) and long-term (from stored data) insights to understand what is going on in their infrastructure and develop competitive advantages. This will help develop better and more personalized services for telecommunication providers.
However, a major challenge for all telecommunication providers in the world is to leverage recent advances and find current technologies to develop data science and engineering platforms. This is because of the various challenges in managing infrastructure and utilizing vast computing resources. Telecommunication companies serve millions of users who depend on their services for their daily needs. In order to maintain these critical services without interruption, telecommunication providers must be prepared to obtain the most relevant and accurate information so that they can make informed decisions and take the necessary actions. For this reason, the design, implementation and maintenance of systems that can process incoming telecommunicationrelated raw data sources and produce high-quality, reliable information to support data analytics and AI/ML systems is critical and falls within the scope of data engineering.
In [10], the authors bridge the gap between DL and mobile and wireless networking research by presenting recent surveys and showing a typical pipeline of an applicationlevel mobile data processing system. The paper in [11] provides an overview of DL for wireless communication networks. The authors in [12] explore the applications of DL algorithms in wireless networks for different network layers, including the physical layer, the data link layer and the routing layer. The paper in [13] gives an overview on unsupervised ML approaches that are applied in the networking domain. In [14], a framework for data-driven networks for * Facebook AI Research, https://research.fb.com/publications/ † Google AI Research, https://research.google/pubs/ ‡ Microsoft AI Research, https://www.microsoft.com/en-us/research/ proactive optimisation and related technologies in online data analytics for 5G systems are explored. The paper in [15] provides an overview of evolving ML algorithms applied to self-organizing cellular networks. The authors in [16] focus on the possible solutions on how ML can help support the targeted 5G network requirements. The papers in [17], [18] provide an overview of ML algorithms and their applications in SDNs. The authors in [19] focus on the use of AI and ML for the design and operation of Beyond 5G networks. The paper in [20] gives an overview of ML in wireless communications and presents some unresolved issues. The paper in [21] gives an introduction to the use of data science and describes steps towards knowledge discovery in the context of wireless networks. In [22], various software architecture practices that exist in the context of ML-based software systems are described. The authors in [23] provide an overview of related research on AI-based green communication and how it can be used to accelerate applications in 6G. The position paper in [24] presents an agenda for addressing the challenges associated with analyzing network data through AI/ML so that it can be naturally adopted in the network domain. In [25]. a comprehensive overview of ML solutions for 5G cellular networks is given. However, the focus of these contributions is on data science aspects rather than on data engineering frameworks and building End-to-End (E2E) data engineering pipelines. There is also a lack of detailed analysis of the applications of these technologies in larger areas/domains of E2E network management and orchestration.
From a network management perspective, the paper in [26] addresses ML solutions that can be used as a tool for implementing network management, automation and selforganization from a 5G perspective. Self-healing solutions for emerging and future mobile networks are explored in [27]. Networking issues in Big Data are addressed in [28]. The authors in [29] discuss the role of AI techniques in the emerging concept of Zero-touch network and Service Management (ZSM). Although these papers focus on network and service management issues, the emphasis is on applying data science concepts to network management rather than exploring data engineering aspects. The survey paper in [30] brings together both Big Data Analytics (BDA) and Network Traffic Monitoring and Analysis (NTMA) research, and focuses specifically on approaches and technologies that can manage the big NTMA data. However, the focus of this research is only on aspects of network traffic monitoring and analysis, concentrates on a limited number of available Big Data tools, and lacks analysis of network management and orchestration aspects. The survey paper in [39] focuses on Data Center Networking and divides it into infrastructure and operations. However, the focus of this paper is on the operational and infrastructure aspects of data center networking rather than merging it with data engineering concepts and developments.
The authors in [31] conducted a comprehensive survey of wired, wireless, and hybrid data centers to assess whether they can meet the requirements of a real-time analytic Internet of Things (IoT) network. However, the analysis focuses only on real-time analytics and does not examine the network management and orchestration aspects. The paper in [32] provides an overview of the role of BDA in designing a variety of data communication networks. The authors in [33] aim to design an industrial data platform that provides higher real-time performance and compression ratio for industrial data acquisition and processing. However, these works lack details on building an E2E data engineering pipeline and do not unify these data engineering components with network management and orchestration. The authors in [34] provide an overview of Big Data tools, but the focus is on manufacturing applications rather than telecommunication networks. The review paper in [8] provides a comprehensive overview of the latest AI software developments with comparisons and trends in the development and usage process. In [35], a detailed survey of distributed learning frameworks such as federated learning, federated distillation, distributed inference, and multi-agent reinforcement learning over real-world wireless communication networks is presented along with the rationale for their deployment over wireless networks. However, these papers mainly focus on data science developments rather than data engineering aspects by presenting only the latest developments in DL, distributed learning or machine learning frameworks and libraries. The paper in [36] provides an overview of popular Big Data frameworks for processing big data and compares them using batch and iterative workloads. However, the authors do not address how they relate to recent developments in network management and the orchestration aspects of telecommunications. In [37], the authors present an example BDA pipeline and architecture called LambdaTel for telecommunication enterprises. Similar to our six-stage categorization of the data engineering pipeline, the life cycle of BDA is divided into four sequential stages namely data acquisition, data preprocessing, data storage and data analytics in [38] for wireless networks. On the other hand, these pipelines do not focus on the application of these sequential phases to network management and orchestration problems in industry, academia, and standardization bodies. Moreover, they only cover a few components compared to the data engineering pipeline proposed here (e.g., details on the data management & orchestration component are missing).
In contrast to the survey and review papers, this paper provides a tutorial view of the recent date engineering solutions to drive state-of-the-art developments in telecommunications, including papers in this research area that fills in the gaps of the previous survey and tutorial papers on BDA and network management & orchestration. Despite the previous work on the potential applications of Big Data infrastructures in telecommunication networks, there is still a lack of a general E2E data engineering architecture for telecommunication and networking domains. In particular, it is unclear to practitioners, researchers and developers in this emerging field how to systematically collect, ingest, analyse, process, visualize, and manage telecommunication data for network management and orchestration. Therefore, in this tutorial article, we explain how telecommunication networks can be used for data-driven analysis and autonomous optimization in future telecommunication networks using data engineering. In this paper, we mainly focus on current data engineering concepts and technologies related to network management and orchestration, which are essential pillars for creating data-driven intelligent support for telecommunication operators. Finally, we point out gap analyses, major challenges, and research issues that need to be addressed in the future.
Data engineering concepts have emerged to develop scalable and resilient data warehouse systems or tools that can support complex analytics across AI/ML infrastructures, platforms, or products desired by data scientists, ML engineers, or product teams. It is designed as a superset of Business Intelligence (BI), data warehousing, DataOps, data management, data architecture, orchestration, and software engineering. Depending on complex business and technical requirements at large scale, data engineers are expected to continuously evolve data pipelines and processing models. An integrated environment for data acquisition, storage, analysis, monitoring, visualization, and a scalable computing environment in which data-driven applications and analytics tools can be deployed is of interest to the data engineering field. Data analytics enables organizations to gain key insights into the user experience using data pipelines. These pipelines are a key component to continuously leverage the evolving data-driven ecosystem in the ecosystem.
Historically, data engineering has its roots in the trending topic of Big Data, when the Hadoop project code was first released in 2006 [42]. Hadoop became popular by providing distributed Big Data storage and processing capabilities. The main systems consisted of four main modules: Hadoop core, Hadoop Distributed File System (HDFS). Map-Reduce [43], and Yet Another Resource Negotiator (YARN) [44]. After the introduction of Hadoop, many Apache projects emerged that built their core functionalities on top of Hadoop and from which the Hadoop ecosystem evolved. At the same time, Hadoop is still relevant today and continues to be used as the core foundation for many data engineering projects. In parallel with this movement, many organizations today rely on similar design principles and advanced algorithms on distributed systems to process data storage, messaging, management, and compute functions on multiple servers in parallel.
The movement and processing of data can be achieved by creating streaming pipelines or data pipelines. In data pipelining, multiple data processing modules are chained together and the output of each module is used as input to the next module. Data engineering toolboxes enable organizations to process huge amounts of data reliably and quickly VOLUME 4, 2016 while gaining access to better, cheaper, and more accessible data analytics software and services. On the other hand, each newly introduced technology or component within a data engineering pipeline brings its own configuration, protocols, metrics, and tools, adding complexity to the overall platform.
Today's challenges and requirements for a data engineering solution include processing millions of tasks per second, latency in the order of sub-milliseconds, stateful computation (via functions that store data across processing items or events), and ensuring fault tolerance (e.g., by checkpointing state to recover state and positions in the stream). Most toolboxes are designed for scalability and are deployed in distributed environments where aspects of data distribution, replication, and coordination become important differentiators. At the same time, much of the software is becoming commoditized through open source software and packages, while the supply of data engineering solutions from cloud providers such as Microsoft, Amazon, and Google is increasing. As the data toolbox matures and the data engineering ecosystem blossoms, innovative solutions such as Online Analytical Processing (OLAP), scalable machine learning analytics will become more tangible to larger communities and enterprises. Note that the connections between each module are only loosely represented and there may be multiple interfaces between these modules depending on the use case. Therefore, there may be multiple pipelines based on Service Level Agreements (SLAs) or non-functional requirements. One pipeline may be suitable for real-time notifications, while another may be more suitable for more relaxed requirements. Furthermore, in some scenarios, such as IoT networks, raw data (e.g., temperature, humidity) collected via the Data Connect module can be integrated with the Data Visualization component either via mobile applications or web user interfaces for direct visualization in a dashboard. In other scenarios, further data analysis and processing using recent advances in AI/ML algorithms may be required (e.g., in the case of high quality prediction, statistical analysis or root cause analysis [45]), which takes place between the Data Connection and Data Visualization modules.

B. OVERVIEW AND TUTORIAL OBJECTIVES
In other scenarios that require high reliability of data (e.g., in Ultra Reliable Low Latency Communications (URLLC) services of 5G networks), the Data Ingestion and Data Analysis modules (for ultra-low latency real-time event processing) need to be embedded in the data engineering pipeline. More details on the modules and the corresponding distributed computing frameworks/landscapes available today to run each of these modules are provided later in the paper. In addition to the open source frameworks, we have also listed vendor-specific tools/frameworks for each module whose main advantage over in-house setup is the ease of set-up and maintenance support for the data engineering applications. On the other hand, cost can be considered as one of the main disadvantages when these applications are deployed on a large scale with vendor-specific tools.
The main contributions of this paper can be summarized as follows: • Our goal is to provide a comprehensive and thorough overview of the recent advances and major technologies used in the context of data engineering. This allows for easy understanding and comparison of studies within each area of the data engineering landscape. • Our goal is to link the capabilities of the data engineering ecosystem with a possible link to future telecommunications systems. Unlike previous work on data engineering, this paper also explores the necessary link that needs to be established between recent advances in data engineering and traditional telecommunication ecosystems in the context of network management and orchestration. • The paper provides a comprehensive discussion of challenges, gaps and future directions in the convergence of data engineering and network management and orchestration, and also identifies research directions that arise. • To drive research in data engineering solutions for future network management and orchestration solutions, we also discuss possible solution methods for each of the above challenges. The remainder of the paper is arranged as follows. Section II introduces Data Connection frameworks and the possible data sources. Section III presents frameworks for Data Ingestion. Section IV discusses the latest frameworks for Data Processing and Analysis. Section V presents the latest frameworks for Data Storage. Section VI presents frameworks for Data Monitoring and Visualization . Section VII presents frameworks for Data Management and Orchestration. Section VIII discusses the relationship of data engineering projects with data science frameworks and AI/ML platforms used in the industry. Section IX gives an overview of network lifecycle management and orchestration. Section X provides an overview of standardization efforts in network management and orchestration and how they can be related to data engineering frameworks. Section XI gives an overview of data engineering use cases in telecommunication networks. Section XII provides the gap analysis, challenges and future directions. Finally, Section XIII presents the conclusions of the paper.
Representational state transfer (REST)-APIs are mainly used to import data from many third-party APIs into the Big Data processing cluster. It benefits from the ideas of stateless servers and structured access to resources. An API gateway is a software component and acts an entry point into a system. It is responsible for allowing multiple APIs, backend systems or microservices to be accessed reliably and securely by end users. Kafka Connect, which is part of Apache Kafka [46], is used to enable data flows between Kafka and various types of systems such as message queues, Hadoop, Spark, Flink, TensorFlow, databases, object stores or flat files. It supports pluggable connectors (which are essentially jar files that can be downloaded from Confluent Hub for example). Apache Flume [47] is a service for efficiently collecting, aggregating and moving large amounts of data in a distributed and reliable manner. GraphQL * is a query language for APIs and designed as an alternative to REST-API that allows a variety of different frameworks to connect from the client side during client-server communication.
The advantage of GraphQL is that it prevents over-and under-fetching of data compared to REST-APIs. Falcor † is another open source library used to retrieve data that may reside in a client's memory or over the network on the server. Apache NiFi ‡ is mainly used to automate data movement between different systems. It provides a webbased user interface for creating, monitoring or transforming (e.g., converting Comma Separated Values (CSV) files into individual JavaScript Object Notation (JSON) records) data streams.
Vendor specific tools: Amazon's API Gateway (for connecting to devices), Azure Event Hub and Data Gateway (a proxy that provides on-premise access to data) and Google Cloud Platform (GCP)'s Cloud Dataflow, other connector platforms such as FiveTrain § , Stitch ¶ , Matillion || are some example tools and frameworks for data connectivity.

B. DATA SOURCES TO CONNECT
In telecommunication operators, there are various sources of information from which data can be retrieved for further analysis by data engineering frameworks. Generally data is available in three different systems: Information Technology (IT), network and application systems. ). Some of the typical data sources in mobile networks are also described in [48]. • Application/service data is data from products and services (e.g., online mobile payments, online music and e-wallet applications or vehicle tracking, power grid information and health services, other value-added services, etc.) provided by telecommunication operators that contain user data (e.g., user access modes, addresses, timestamps, business preferences, consumption habits, customer care agent's data). Note that the underlying structures of this data are complex (either unstructured (text, images, videos), structured or semistructured), so targeted data engineering pipelines for different data types are required depending on the use case.

III. DATA INGESTION FRAMEWORKS
Along with the advent of 5G, IoT and mobile sensor devices, powerful messaging platforms are needed that can ingest and cache all traffic for later processing. To reliably publish and subscribe to events, highly available, fault-tolerant ingest pipelines are required that can serve as the backbone of the streaming data infrastructure. To realize this, data ingest frameworks enable replication and partitioning of data across nodes in the cluster. The data ingestion module acts as an intermediary or multi-tenant data hub that connects incoming data from different data sources to diverse sinks to ensure that data is not lost during this movement. The source systems can be any vendor application, database or event. Data Ingestion is generally used to move data between external systems and Big Data clusters or data lake (e.g. based on Hadoop) for batch or stream processing (e.g., for filtering and mapping operations). A stream is an unbounded and continuously updated data set. In general, it consists of sequences of key-value pairs that are ordered, replayable, and fault-tolerant. Streaming data can be injected into clusters in real-time or near real-time. When loaded the data can be used for later processing (e.g., with Apache Spark) or storage (e.g., with HDFS). When communicating during data ingestion, two types of message delivery patterns can occur [49]. One is queuing and the other is streaming.

A. QUEUING
In queuing, the order is not important because all events from the message queue are transferred from the device/user to a system. Some examples are payments and transaction processing, where they are mostly used for mission critical systems. Message queues provide asynchronous communication protocols where the sender and receiver do not need to interact with the message queues at the same time and are common features of data ingestion frameworks. Platforms and protocols such as RabbitMQ [50], JMS, AMQP and others enable asynchronous data integration between multiple systems by acting as a central hub. They are best suited for message queuing applications, such as real-time transaction services with zero tolerance to data loss. Therefore, consistency and durability are the most important features of these systems. On the other hand, these systems may suffer from scalability issues during peak loads (e.g., RabbitMQ is not designed as a distributed system). This is also true for web service/API based architectures due to their synchronous communication [51], [52].
Apache Kafka [46], on the other hand, is a popular and widely used framework for real-time processing of streaming data and is best suited for managing data pipelines to move large amounts of data between different systems. As a common event bus that decouples producers and consumers, it can handle hundreds of thousands of events per second, is scalable and widely used in the industry (used by many companies such as Twitter, Netflix, etc.). Therefore, integration with other technologies and frameworks in creating data pipelines has also become easy. For example, data sharing is possible with many interfaces including file-based systems, real-time messaging, REST web service, Structured Query Language (SQL) or Not Only SQL (NoSQL) databases, data lakes or data warehouses. In addition to simple data sharing, messaging and integration, it can also be used for data storage and processing. However, Kafka is not suitable for storing and processing large files such as images and videos as a whole. To ensure the reliability of streaming requests, Table 1 provides the three different consistency guarantees available for data ingestion or stream processing. Note that Message Queuing Telemetry Transport (MQTT), a lightweight publish-subscribe messaging transport protocol, provides a real-time and reliable messaging service and has also defined similar QoS levels. QoS level 0, level 1, and level 2 are at most once, at least once, and exactly once respectively. QoS level 0 acts as a best-delivery mechanism, QoS level 1 waits for the receiver's PUBACK packet and retransmits it if it is not received, and QoS level 2 sends a sequence of four messages to guarantee exactly once reception.

B. STREAMING
Stream processing engines/platforms (such as Spark Streaming, Apache Flink) enable strictly ordered and exclusive message passing while allowing computational logic to be applied to message streams [53]. In a streaming-based communication pattern, the ordering of events is important so that behaviour can be analyzed based on the sequence of ordered events. Streaming is most suitable for stateful applications and OLAP-oriented use cases (e.g. BI, dashboards, ML, etc.). Some examples are user behaviour analysis, anomaly detection and web traffic log analysis. In streaming, the loss of data can be tolerated in some parts as long as the correct ordering of events is maintained. This is because stream processing systems provide fault tolerance and retries by rewinding the stream and replaying each event from the point of failure or occurrence of an error [54]. Apache Storm [55] is a distributed stream processing framework designed for both batch and distributed streaming data processing. Libraries of processing frameworks such as Spark Streaming (an extension of Spark's API [56] for stream processing), Apache Flink (using DataStream API for bounded (finite size)/unbounded (infinite in size) streams) also have data ingestion capabilities. Spark Streaming uses a micro-batch architecture for continuous data processing that can ingest data from Apache Flume, TCP websockets, or Kafka producers. Spark relies on exactly-once processing to ensure correctness.
Apache Pulsar * has recently emerged as a competing technology to Kafka. Pulsar has similar features to Kafka and acts as a distributed pub-sub messaging system with some differences in architecture, performance, and features. Pulsar can also integrate with full-fledge stream processing frameworks like Spark and Flink. On the other hand, Pulsar offers more flexibility as it uses a layered design compared to the monolithic design of Kafka. For example, Kafka uses Zookeeper (with plans to remove it in future releases with a new controller metadata quorum) and the Kafka broker itself (a two-tier architecture that tightly couples storage and serving), while Pulsar requires three distributed systems: Zookeeper, Apache Bookkeeper (three-tier architecture), and also RocksDB for certain storage tasks. The computations are performed on a broker in one tier and the stateful storage is managed in another tier (Bookkeeper). Therefore, Pulsar aims to separate serving and storage in different tiers and provide both queuing (messaging) and streaming capabilities in a single system. To achieve this, Pulsar offers four types of subscriptions, depending on the application's ordering and consumption scalability requirements: 1) Exclusive subscription: Only individual consumers may subscribe. 2) Failover subscription: Multiple consumers may subscribe to a single topic. * https://pulsar.apache.org/, accessed December-2021 3) Shared subscription: Messages are delivered to multiple consumers in a round-robin fashion. This allows the number of consumers to scale beyond the number of partitions. 4) Key_Shared subscription: Multiple consumers can join the same subscription and message delivery will be shared among consumers that have the same key. It allows higher scaling as well as order guarantees at the key level. For messaging/streaming applications, exclusive and failover subscription modes are most suitable for scenarios where partition level order guarantees are needed, while queuing applications can use shared and key shared subscriptions [57].

C. USE CASES AND REQUIREMENTS
During data ingestion, data enrichment, filtering, aggregation, transformation, etc. can also be performed for better use in sinks. Stream-based architectures have been shown to provide a better architectural foundation for many use cases in the industry including the use case for fraud detection [58]. In the telecommunications domain, the authors in [37] have provided several use cases for the application of streaming. Within the traditional telecommunication landscape, there are several components within the OSSs, Business Support Systems (BSSs), and OSS-BSS integration modules. The OSS/BSS landscape covers various functionalities across mediation, billing, CRM, e-business, data warehouse, service assurance, provisioning, etc. [59]. Therefore, the data ingestion frameworks described in this paper (such as Apache Kafka) can be used in various ways in these OSS, BSS and OSS-BSS integration development projects, e.g. in critical business applications such as payment or fraud detection applications.
Application requirements may vary depending on realtime (less than 10 ms), near-real-time (less than a few minutes), or high-throughput (processing data on the order of PB/day in a single cluster). At the same time, data ingestion can be performed with different components: Batch processing jobs managed by data orchestration engines can be used for complex processing and deep analytics. On the other hand, streaming jobs (e.g., Spark Streaming, which consumes data from Apache Kafka streams) can be used for fast feedback and anomaly detection, sync-async, publishsubscribe, change data capture, or REST services that expect data for the data cluster. Various changes can be made in the data ingestion frameworks to meet these different application requirements. For example, for applications that require low-latency, the write-once-read-multiple times (WORM) model can be used, where a dataset generated or copied from the source can be used multiple times later for different analyses [60].
When ingesting data streams, it should be possible to query the data as soon as it enters the system to enable immediate actions and insights. For this reason, various preprocessing analytics such as cleansing, profiling, ag-VOLUME 4, 2016 gregation or enrichment of the dataset can improve query response time. In the WORM model, querying can be faster if there is an additional cost in data entry, for example if each map-reduce job is entered before the analytic queries. Some of the fast query systems like Apache Druid [61], HiveQL [62] are based on the principle of additional cost incurred in data ingestion. For this reason, before executing queries, each data segment must be ingested using some map-reduce ingestion jobs. This is also related to the mutation rate requirements of data. For data coming from transaction applications, extensive writes must be performed via Online Transaction Processing (OLTP) (which incurs additional costs) and for data coming from OLAP systems, extensive reads can be performed but only small writes.
In the telecommunication industry, there are many use cases where open source or proprietary/vendor-specific frameworks for data ingestion are already in use, e.g. in monitoring, middleware, event hub platforms, and business applications (billing, supply chain management, BSS, B2B products, etc.). Communication of these various entities in a large enterprise via message brokers has certain advantages over REST-based API. In fact, there can be data ingestion issues when scaling applications over REST-APIs. As the application grows and more services are added, complex relationships between services arise, requiring API connections to be remapped. This also delays the workflow development and implementation process. Moreover, due to the synchronous communication structure of REST-based API's, when there is an influx of data, i.e., sudden burst requests, some of the services may operate slowly or even be unavailable, affecting the reliability of whole application [51]. For this reason, message queuing (especially for streaming applications) can be more robust compared to REST APIs and scale with the requirements of each application [52]. In the case of microservices based architecture, the ability to use pub/sub systems may be an appropriate way to inform microservices of the potential events (request) that may require a reply from the corresponding receiver (response).

D. OTHER AVAILABLE TOOLS
In addition to the Apache Kafka and Apache Pulsar streaming/messaging tools described above, there are other data ingestion frameworks that can work at scale [63], [64], [65], [66], [67]. Logstash and Beats are the core components of Elastic Stack that are used for ingesting data from any data source and transferring it to Elasticsearch [63]. Apache Heron is a real-time stream processing engine that has proven itself at Twitter for big data [64]. Apache Gobblin * is a distributed data integration framework that simplifies data integration for both streaming and batch data ecosystems. It is used to extract, transform, and load large amounts of data from various sources, such as databases, REST APIs, onto Hadoop, and also simplifies data ingestion, organizational * https://gobblin.apache.org/, accessed December-2021 replication, and lifecycle management. Apache RocketMQ † is a distributed messaging and streaming platform. Redis PubSub [65] is an implementation of the messaging system provided by Redis. Apache Sqoop [66] is used to import/export data from/to MySQL databases to/from HDFS in specific file formats. Apache Camel [67] is an open source integration framework designed as message-oriented middleware that provides interfaces for the Enterprise Integration Patterns. A good comparison of some of the most recent message queuing systems (Kafka, RabbitMQ, RocketMQ, ActiveMQ, and Pulsar) can be found in [68]. An amazing list of existing streaming frameworks and applications can also be found in [69].
Vendor specific tools: Amazon's Kinesis Streams ‡ , (for buffering purposes and works similarly to Kafka), Facebook's Puma, Swift, and Stylus stream processing systems [49], Google's Cloud Pub/Sub § (data ingestion and messaging for event-driven systems as well as streaming analytics), and Azure's Event Hubs ¶ , IoT Hub || , Stream Analytics ** are the respective data ingestion services offered by the IT and cloud giants. Splunk † † collects and analyzes machinegenerated data on a large scale. As a messaging system, Oracle Enterprise Messaging Service and IBM Websphere MQ are other examples of event buses for processing asynchronous data streams.

IV. DATA ANALYSIS AND PROCESSING FRAMEWORKS
After data is ingested, it must be analyzed to gain insight so that resources, services, or assets can be managed more efficiently. The data processing and analysis phase ensures that this is achieved through various tasks such as data transformation, enrichment, filtering/deduplication, mapping, cleansing, updating state, joining, grouping, defining windows, aggregation, staging, integrity checking and combining streaming and batch data. Data analysis and processing frameworks generally fall between data ingestion and data storage frameworks. All data processing tools first consume data, then process it, and finally produce results. Batch processing involves processing large amounts of data, usually stored in data lakes/warehouse. In this case, the data does not change much or at all. Moreover, SLAs are loose in terms of processing time for a given dataset.
In batch processing, the results are obtained slowly but accurately. In stream processing on the other hand, the data is constantly changing (e.g., Twitter or Facebook feeds or weather sensor data in real-time) and is not stored in large data warehouses. At the same time, SLAs are more rigid and

Consistency Guarantees Description Example Applications
At Least Once -The message is pulled once or multiple times and processed each time (likely duplicates are received).
-During transmission, it is not possible for message to be dropped or lost, hence the receipt is guaranteed.
-No data is missed.
-Duplicate messages can be allowed and ignored after further processing considering the timestamps of the records.
-Typical usage at scale -Scaling service applications -Tracking a user's (or device) location via cell phone (or vehicular) records.

At Most Once
-The message is pulled once (no duplicates are allowed), hence the message may or may not be received.
-Applicable for use cases where possible missing data is tolerable -Typical usage at scale -Periodic sensor data (e.g. temperature, humidity) sent in high frequency intervals. -Scaling service applications

Exactly Once
-Message is pulled one or more times but processed once compared to at least once (no duplicates are allowed).
-Like at least once, receipt is guaranteed, no missing data is allowed. Hence, there is only one complete successful process.
-Especially useful for one-time processing scenarios such as where exactly once processing is crucial even when a broker or instance of function fails.
-The complexity of the system increases with exactly once transmission mode of operation which introduces latency.
-The benefits of exactly-once-processing should be weight against the drawbacks of additional latency cost.
-Some data stream processing frameworks such as Flink Data Stream, Spark Streaming can enable exactly once processing at the expense of increased latency when there is deep pipelines and high volumes of input.
-Distributed transactions at scale is difficult.
-Other guaranteed lossless event processing applications (Image or video upload and processing in cloud). data processing should be done in real-time to serve other processes, i.e., no shifts in data processing are allowed as in batch processing. In streaming, the results can be fast depending on the requirements, but sometimes they need to be approximate because of the trade-off between low latency analysis and efficient use of computational resources over a distributed infrastructure. For this reason, the paradigm of approximate computing is used in the literature for streaming analytics, which allows trading output accuracy for computational power by analyzing only a subset of the input [70].

A. DATA ANALYTIC CATEGORIZATION
Data analysis can be categorized into four dimensions [73].
• Descriptive analytics is used to understand what has happened in the past, including the recent past [74]. In this analytics, the data is examined statistically. Example: An alarm monitoring system has received an unusually high number of alarms at a telecommunication provider's network operation center. Using descriptive analytics can help understand what is going on in the infrastructure by using real data and corresponding statistics (alarm timestamp, location, severity, etc.). • Diagnostic analytics is used to understand why something happened in the past [75]. It is a step further to descriptive analysis. For example, if many alarms have occurred in a telecommunication system, diagnostic analysis helps to understand the root cause of the alarms, such as power failure or connection error.
• Predictive analytics is used to predict what will happen or is most likely to happen in the future [76]. In a ML system, the model is trained and later used for inference in production systems. For example, an increase in alarms in network monitoring systems can be predicted to take preventive measures much earlier before an outage occurs, e.g., in transport links. • Prescriptive analytics is used to recommend actions to influence outcomes [76]. It is another step further in predictive analytics. For example, if we know that alarms will increase due to a transport network failure, network operators can enable redundant transport links to route user traffic to a different routing path.

B. MODEL DEPLOYMENT
The frameworks developed in this module also complete the full cycle of the ML process through tasks such as feature extraction, distributed Graphics

C. STREAM PROCESSING ENGINES
One of the earliest versions of data analysis and processing frameworks was based on the Hadoop's Map-Reduce paradigm. As the ecosystem continued to evolve, new and more advanced versions of Map-Reduce emerged based solely on batch processing, extending it to streaming applications as well. In streaming applications, data is continuously generated from multiple sources and frequently updated. Moreover, in most cases, the data sources send their information simultaneously. Therefore, it is important to analyze streaming data sources and derive, track, and interpret important streaming data quality metrics. In streaming applications, each data must be accurately processed to preserve its sequential order in time as well as its relationship with other data sources. A complex job like streaming with multiple input data streams and stages usually has many internal states and performs more remote calls to other REST endpoints and databases, e.g., for joining operations.
Streaming can occur in various forms, e.g., socket-based streaming (connecting to a socket to get data), file-based streaming (connecting to a file stream to get data), or data ingestion tool-based streaming (e.g., connecting to Kafka or Kinesis to get queued messages reliably (i.e., with acknowledgements)). Note that analyzing and processing data in real-time data enables organizations to make decisions proactively, rather than reactively. Some of the applications of streaming data include scenarios such as sending realtime notifications to users via email or push notifications when events change in their user account, sending sensor data in vehicles, industrial equipment or machinery for predictive maintenance purposes, tracking of geolocation of user phones or vehicles for transportation and supplychain purposes, monitoring patients' vital bodily functions for health purposes, or collecting streaming data and create personalized products and services for users of an online retail company. For these various reasons, streaming data analytics is becoming increasingly important for most industries.
Recently, many frameworks have also emerged that can are suitable for batch (offline data) and streaming (realtime events) analytics (Apache Spark [56], Apache Flink [78], Apache Beam ‡ ‡ , etc.) while operating under a unified data processing layer to perform common computations and reduce data infrastructure complexity. These frameworks are known for their extremely scalable data processing capabilities that can analyze petabytes of data while extending to hundreds of instances. They can also provide different levels of abstraction. On the other hand, each of these frameworks has its own strengths and weaknesses.
Apache Spark [56] is a distributed computing platform that can run standard Extract-Transform-Load (ETL) processes on Big Data in batch and streaming mode using a set of APIs. It is capable of processing jobs 10 times faster than its Map-Reduce counterpart. The spark core API is able to load data from different platforms (e.g. Kafka, Flink, Kinesis, HDFS, Amazon S3, Azure Blob, etc.) and write it back to other platforms (e.g. Hadoop, Elasticsearch, Cassandra, etc.) after processing the data. The Spark ecosystem is also extensive and includes a number of libraries that can run on top of the Spark core and can easily implement and support various workloads and Spark programs such as BigDL [79] (which supports inference, transfer learning, and distributed training), Ten-sorFlowOnSpark § § (which supports inference, ML pipelines, and distributed training), Deeplearning4j ¶ ¶ (which supports inference, transfer learning and distributed training), and ‡ ‡ https://beam.apache.org/, accessed December-2021 § § https://github.com/yahoo/TensorFlowOnSpar, accessed December-2021k ¶ ¶ https://deeplearning4j.org/, accessed December-2021

Repartitioning
Under this operation, the number of partitions can be increased or decreased. This is one of Spark feature (\textit{transactions.repartition(n)} where n is desired the number of partitions) and is an expensive operation since it executes full shuffle.

Rebalancing (Round-robin partitioning)
Partitions elements in a Round-robin manner to create equal load per partition. This is one of Flink's feature (\textit{stream.rebalance()}).

Coalesce
This operation reduces the number of partitions and avoids a full shuffle. This is one of Spark feature (\textit{transactions.coalesce(n)} where n is desired the number of partitions).

Rescaling
Round-robin partitioning of elements, to a subset of downstream operations. This is one of Flink's feature (\textit{(stream.rescale())})

Custom partitioning
This operation uses a user-defined partitioner to select the target task for each element in case different performance is required from the in-built \glspl{API}. This is one of Flink's feature (e.g. \textit{stream.partitionCustom(partitioner,0)}).

Broadcasting
It broadcasts all elements into every partition. This is one of Flink's feature (\textit{stream.broadcast()}).

Random partitioning
Partitions elements randomly according to a uniform distribution. This is one of Flink's feature to avoid mulfunctioning of the system (\textit{stream.shuffle()}). ports bounded data sets for batch use cases. DataStream APIs allows for large and more sophisticated operations compared to DataSet APIs. Flink is designed to process streaming data quickly and accurately with its stateful stream processing capabilities.

D. WINDOWING OPERATION AND PARTITIONING
There are two main window types that can be applied to the data stream (see Table 3). For streaming applications, time spans are critical. The window concepts (tumbling, sliding and session window) that are mentioned above are closely related to the time characteristics, which are another important concept for data engineering pipeline and frameworks. In any data processing, the following key concepts of timing are important: • Event time: represents the time at which an event is generated at the source of the system. • Ingestion time: represents the time at which the event enters dataflow. • Processing time: represents the time when an event was processed/observed in the system. VOLUME 4, 2016 In an ideal scenario, the interval between event time and processing time should be zero (real-time applications). However, this may not be the case for various reasons (e.g. network congestion, software logic, etc.). Note that depending on the use case, time characteristics may be important (e.g. fraud detection, most billing applications).
Unlike Apache Spark Streaming which has time-based window criteria, Flink provides more powerful window semantics (both time-and data-driven window options) and allows to work with data-based or custom window criteria. Apache Samza [81], developed by Linkedin, is a stream processing framework for real-time analytics and is well integrated with Kafka. Apache Beam is used to build an execution pipeline that can implement both batch and stream processing under a unified programming model. For example, in the case of model serving, multiple runners such as Spark or Flink can be executed using Beam's distributed processing backends. So, in this respect, it is more of a software development kit than a stream processing engine. Frameworks such as Apache Livy * enable a REST service for executing Spark jobs via third-party applications.
In order to process large amounts of data, the data must first be partitioned so that it can be processed in parallel. For this reason, partitioning is a key concept. It can be done either in memory or on storage disk to improve performance and reduce cost. Most data processing frameworks perform partitioning in memory (e.g. Spark or Flink). There are several ways to manage a partition in Spark or Flink. Some of them are listed in the Table 4. Spark recommends 2-3 tasks per CPU in a given cluster to keep the level of parallelism high. For this reason, the recommended number of partitions in a given system is simply the number of CPUs ×[2 or 3].

E. QUERY-BASED ANALYSIS
Using SQL as a common interface for data has many advantages. Query-based engines allow to build applications that interact programmatically with metastore tables. Many different systems provide an SQL interface for streaming and batch data on different platforms. Originally developed approaches such as Apache Hive [62], Apache Pig [82], Trino † and Apache Impala [83] are used to query large distributed batch data with SQL-like syntax from Big Data storage systems such as HDFSs. On the other hand, newer data processing frameworks like Apache Spark SQL library (supports partial SQL functionalities), Apache Drill [84] (supports writing SQL queries similar to MySQL, Microsoft SQL Server, or Oracle), Apache Flink [78], Presto [85], Samza SQL [81], Apache Kylin [86], Apache Pinot [87] are some examples that are being used beyond these originally developed approaches to query data across many dimensions and metrics, various data sources including Hadoop clusters, NoSQL databases (Hbase, MongoDB, Cassandra, etc.), file * https://livy.apache.org/, accessed December-2021 † https://trino.io/, accessed May-2021 systems/formats (CSV, Parquet, Avro, JSON, binary files, etc.), and cloud storage platforms. The graph query languages Cypher [88] and GQL ‡ (which are more declarative and easy to read compared to SQL) are used for graph database queries. Kafka Streams or ksqlDB [46] is a Javabased library designed to provide streaming processing applications based on Kafka in a fault-tolerant and distributed manner. It can provide full-fledged stream processing capabilities that include consumption of Kafka topics, provision of state information, basic transformations, filtering, sliding windows, calls to ML jobs, exactly-once processing, etc. Similarly, ksqlDB allows processing of records/events using a SQL-like language. On the other hand, Kafka Streams is not well-suited for processing large amounts of data. Apache Kudu [89] is designed for fast analysis of high speed data (that is rapidly changing data), and can perform queries over billions of rows and terabytes of data per second. In addition, there are several ways to access querybased Big Data frameworks, such as through a custom shell, a REST interface, a web interface, or through transport protocols such as Java Database Connectivity (JDBC)/Open Database Connectivity (ODBC) drivers, database protocols (MySQL, Postgres, Hive, etc.), and Remote Procedure Call (RPC) with Thrift, Protobuf, JSON, XML and so on. Spark's Thrift server, for example, enables this functionality by turning SQL queries into Spark jobs. Apache Arrow § is a language independent big data layout, enables in-memory analytics for fast processing and movement of data. It essentially enables sharing and serialization of high performance data and serves as a communication interface. Apache Arrow provides bindings between many components, e.g. reading a file in a given format (e.g., in Parquet data format) and converting it to another format (e.g., Spark dataframe/dataset) for further processing without conversion issues. Some other frameworks, such as the Ray architecture [80], are based on Apache Arrow.
Vendor specific tools: Google provides Big Data tools like DataFlow ¶ , DataProc, BigQuery (query engine for static datasets), Cloud AutoML, etc. as a service to its customers. Amazon offers SageMaker for ML, AWS Kinesis || for realtime streaming, Amazon Athena (based on Presto), and AWS Redshift Spectrum for interactive query services and EMR. Microsoft offers Azure Event Hubs ** . Databricks offers Databricks Unified Analytics Platform as a managed service. Dremio offers a cloud data lake engine for Big Data queries.

A. RELATIONAL DATABASES
Relational databases consist of constructs (e.g., tables and rows) and constraints (e.g., primary keys and referential integrity constraints) and provide both OLTP and OLAP. For structured data, traditional relational/transactional databases/systems such as MySQL, PostgreSQL are used for persistence performance. In these systems, data related to users (e.g. credentials data from frontend), production and business systems can be stored. In these systems reading data in the database is cheaper than writing data to the database. On the other hand, (horizontal) scaling of these datasets is a big problem, where with increasing table size or many concurrent queries, important operators like grouping or joining become slower (with bad time complexity).

B. DATA LAKES
Data Lakes have essentially evolved to complement data warehouses, which in the 2010s were unable to support semi-structured and unstructured data. Data lakes can store more types and amounts of raw data than relational databases, which have several scaling issues. They are based on unlimited, cheap storage. Unstructured data such as text, documents, images, videos, etc., semi-structured data such as web server logs, streaming data from IoT, etc., and data with high variety, volume, and velocity are stored in Data Lakes (e.g. in distributed storage systems (Hadoop clusters HDFS, Ceph [90]), NoSQL databases (such as MongoDB, Apache Cassandra (inspired by Amazon DynamoDB [91]), CouchDB [92], Hbase, CosmosDB), NewSQL databases (such as MariaDB, MemSQL, VoltDB, InfluxDB, NuoDB)). Depending on the data models, NoSQL databases can also be divided into the following categories: • Graph databases store data as nodes and the connections between the data are called links or relationships. Graph data is used to create whole connections of data that is representations of data elements and their mutual relationships. In networks, for example, network services can be represented as graphs to better focus on the connections between network components. Graphs are versatile and can model some systems better than representations in tabular format. Graph databases (e.g., Neo4j [93], Apache Giraph * ) are used by analytics engines to derive more insights, values and patterns from networked behaviour. They are particularly useful for mesh connections in a data lake. Apache Spark's API GraphX is also used for graphs and graph-parallel computations. (Some use cases are social networks, knowledge graphs, etc.), • Key-value stores store data as a key-value pair containing an attribute name (or "key") and a value (e.g., for user profiling use cases). Some examples are Memcached as an in-memory key-value store used for read performance [94] and Redis server † is an in-* https://giraph.apache.org/, accessed December-2021 † https://redis.io/, accessed December-2021 memory key-value store that can work as a cache, message broker, or database, • Column-oriented databases store data as a set of columns. (for analytical use cases). Some examples are Cassandra [95] or Hbase [96], • Document-oriented databases store data in formats such as JSON or XML documents. Some examples are MongoDB ‡ , Elasticsearch [63], Apache CouchDB. Data Lakes are designed to be much cheaper, easy to write, and store large amounts of data. However, it is difficult to access/read data from Data Lakes because the data may not be stored according to the analysis requirements and lacks consistency/isolation features. They also have complex system architectures due to multiple storage systems with different semantics. On the other hand, recent advances in ML libraries and data science ecosystem projects (e.g., TensorFlow) are starting to be integrated into data lakes after data preparation.

C. DATA WAREHOUSES
Data Warehouses have been widely used since the 1980s to prepare data for business analysis and decision making. They are specifically designed for SQL analysis and BI that require a well-defined schema, indexes, etc., for storage with strong management features (e.g., Atomicity, Consistency, Isolation, Durability (ACID) transaction support). Data Warehouses are basically Massively Parallel Processing (MPP) databases that can handle large amounts of data (usually structured). In a ETL workflow, data is taken from an operational data system or source such as a data lake, transformed, and placed into a data warehouse so that a materialized view of the data can be created for reports and BI. For analytics purposes, data warehouses can be used to run queries (usually written in SQL) over repositories of current and historical data to gain insights. Data warehouses are useful for long-used data that does not change or as repositories for more refined forms of data (enriched, aggregated, etc.). Some examples of data warehouse tools are Presto and Apache Hive, as well as Google BigQuery, Amazon Redshift and Snowflake for cloud native data warehouses and IBM, Oracle, Teradata for on-premise data warehouses.

D. AVAILABLE TOOLS
The following are some examples of open source databases. Apache Pinot [87] can be used as a real-time distributed OLAP datastore and also provides fast OLAP queries on large datasets with low latency. Apache Hudi § is a storage abstraction framework that helps enterprises build and manage petabyte-scale data lakes. Hudi enables features such as upserts and incremental pulls to absorb data changes and apply them at scale to Hudi Data Lakes. ClickHouse ¶ is an open source OLAP column-oriented database, similar to Druid and Pinot, designed to aggregate as much information (on the order of several petabytes) as quickly as possible. TimescaleDB * is an open-source relational database for time series data, intended for query-oriented workloads and based on PostgreSQL.
Delta Lake † (from Databricks) is an open-source storage layer. It mainly provides ACID transactions for Data Lakes, Apache Spark and Big Data workloads/engines for interactive, batch, and streaming queries. Databricks has recently developed a new Lakehouse database paradigm that combines the benefits of Data Warehouses and Data Lakes into a single technology [97]. Iceberg ‡ (from Netflix), recently released by Netflix as open source, is a new table format for storing large, slow-moving tabular and analytical datasets. Alluxio § provides a distributed storage system and distributed data orchestration across hybrid clouds. For object metadata management, Hive metastore takes care of mapping from SQL tables to files and directories in the storage component. The Hive metastore service (a binary API based on the Thrift protocol) is used to update metadata stored in Relational Database Management System (RDMS) such as MySQL, MariaDB or PostgreSQL.
Vendor specific tools: Amazon offers S3 for low-cost storage in Data Lakes, AWS Redshift for data warehousing and DynamoDB for database purposes, Amazon Neptune (as a graph database), AWS Redshift and Amazon Aurora (as a relational database), and AWS Glue for metadata management (similar to the Hive metastore service). GCP also offers several storage options, which are as follows: Cloud Storage (a service for storing objects, i.e., immutable data), Cloud Spanner (NewSQL database with unlimited scale, strong consistency and up to 99.999% availability), BigTable (a scalable NoSQL key-value database service for large analytic and operational workloads), BigQuery (serverless multi-cloud data warehouse), Cloud SQL (fully managed database service and can manage relational databases such as MySQL, PostgreSQL), and Cloud Datastore (

E. PRACTICAL ASPECTS
Depending on the technical requirements, there are different ways to select databases. In the enterprise database layer, there can be a combination of NoSQL, in-memory, or relational databases may be present to take advantage of each strength depending on the use case. For example, if the data is unstructured, Data Lakes may be chosen; if the data is structured and the workload is transactional, SQL databases (in the case of single-node systems), newSQL databases (in the case of horizontal scalability), or NoSQL databases can be selected. On the other hand, if the data is structured but the workload is data analysis solutions such as MongoDB, i.e. databases offering NoSQL services (in the case latency requirements in milliseconds) or one of the data warehouse solutions described above (in the case of latency requirements in seconds) can be chosen, depending on the latency requirements.
There are also various data source formats used commonly by some Big Data processing systems and platforms to store data or exchange data. These include columnar data formats such as Parquet and ORC as well as other various data formats such as CSV, TXT, JSON, JDBS, Avro, binary files, etc. Apache Parquet is an open file format for columnar data that can be used in HDFS to store data along with its schema information and enable various I/O optimizations (e.g., compression), fast columnar analysis and aggregation. It is the default data source for many Big Data processing frameworks and platforms, including Apache Spark for analytics workloads. The open file format of Avro data storage is also used in Apache Spark and Apache Kafka when (de)serializing messages. The Avro file format is row-oriented (as opposed to Parquet). It is a framework for serializing data and can provide direct mapping to JSON, which improves the speed and efficiency of data processing.
Data applications often access and use hot/warm storage databases (e.g., Redis) for real-time access, while cold event data is stored in lower-cost storage devices (e.g., cloud storage, HDFS). Depending on the storage characteristics, there are also different types of durability. In the persistent type, the data is not lost even if the cluster fails completely. In replicated type, the data is not lost even if a limited node in the cluster fails and in transient type, the data is lost in case of failures. In most data processing frameworks, a sharded and persistent database is required that can provide a control store database responsible for storing metadata such as task specifications, task dependencies, or critical system information to recover from failures (i.e., for fault tolerance purposes). When a node in the cluster fails and critical information is in danger of being lost, this datastore is used to regenerate the data by re-executing the tasks required for lineage-based fault tolerance. These specifications are typically stored in a control store database (e.g., ZooKeeper in the Hadoop ecosystem [98] or the global control store in Ray [80], a cluster of Redis databases).
Finally, note that data storage can also be costly as data accumulated over years. Therefore, it is desirable to store more processed smart data than more large raw data, unless required by regulation.

VI. DATA MONITORING AND VISUALIZATION FRAMEWORKS
Monitoring and visualization of data plays an important role in the world of telecommunications. With proper monitoring and visualization, it is easy to uncover insights and patterns, understand relationships between observations, or describe trends or seasonality in telecommunication data. Data visualization is traditionally used for regular performance reporting to fully explain and present the results of data analysis or the data itself. It can also serve as an interface for users to run or compile analytics on data processing and analysis frameworks (e.g., queries against loaded data) and visualize the results. Many business decisions are made on dashboard-based pipelines through daily monitoring and interpretation of the data itself. Notification services and alerts, monitoring dashboards using tools like Grafana * or Kibana [63], business intelligence dashboards (drill downs, top K results), ad hoc query clients/notebooks like Jupyter, heatmaps and user feedback can all be done during the data presentation or visualization stage.
Some of the open source tools and frameworks available for data visualization are as follows. For exploring, visualizing and discovering data through dashboard visualization, Kibana from ELK stack [63] provides a free and open user interface. Grafana is used as an analytics and monitoring tool that allows to monitor infrastructure, applications, and metrics by connecting to databases. Kiali provides [99] dashboards, service mesh observability. Gephi † provides an open graph visualization platform. Dash ‡ is a production Python framework for building web applications and data visualization apps using Flask, Plotly.js, and React.js. Apache Superset § is an open-source visualization tool developed by Airbnb that is used to visualize analytics results (with chart types such as word counts, heat maps, boxplots, etc.). Apache Superset can also be used for queries with Apache Druid. Apache Druid [61] serves as an analytics database, but can also be used as a dashboard for quick results on complicated analytics tasks. As a data application framework, Streamlit ¶ can be used to easily create ML and data science web applications. Metabase ||  source tool for business intelligence purposes.
To monitor data pipelines, entire infrastructure, or cloud native applications, Prometheus [100], Datadog ** , Sentry † † provide monitoring services for servers, databases, tools, and cloud applications. One of the most popular cloud native monitoring tools is Prometheus. Prometheus can identify the applications to be monitored (either 3rd party or onpremise) through its service discovery feature, which can be scrapped through Exporters. The scraped data can be stored in local storage to be queried with PromQL or visualized with Grafana. Prometheus also allows sending alerts to various notification systems such as email, chat systems, etc. based on the configured alert rules.
Vendor specific tools: For reports and dashboard outputs, Tableau, Looker, and Mode; for embedded analytics outputs, Sisense; for advanced analytics outputs Thoughtspot, Outlier Analytics, Anodot, Sisu; for building ML and data science web application framework, Plotly Dash; for data visualization, Google's Cloud Datalab ‡ ‡ , TIBCO Spotfire, Mi-croStrategy, Zepl, SAP's Lumira, Microsoft's Power BI; for running high-performance queries on petabytes of structured data to create powerful reports and dashboards Amazon Redshift and Vertica; and for monitoring applications and infrastructure in the cloud, Amazon's CloudWatch, Google's StackDriver, and Microsoft's Azure Monitor can be used.

VII. DATA ORCHESTRATION AND MANAGEMENT FRAMEWORKS
The use and popularity of microservices and cloud/container orchestration frameworks is increasing due to various benefits such as service discovery or easy horizontal scaling. There is no sign that the ecosystem of data engineering and infrastructure is coalescing into a unified, manageable form. Over time, new distributed databases, frameworks, platforms, and libraries will be introduced into the ecosystem. Because of this, data management frameworks can bring everything together with suitable APIs as these systems evolve into more complex structures day by day. For example, stream processing jobs are more akin to microservices and thus require support for managing services and applications including cluster management, debugging and continuous monitoring. Note that approaches such as centralized management approaches (e.g., scheduling) can create a bottleneck in the system that can only provide a finite amount of throughput (i.e., in terms of processing capabilities of tasks per second). Therefore, the importance of automated and distributed resource management, scheduling, and orchestration is steadily increasing in the modern data ecosystem. The main goal is to start or manage computational resources, services or containers with a single or a set of API calls to reduce operational costs and complexity.
In data applications, all computations consume and produce data which must be orchestrated in a data-aware

A. SCALING
The number of data sources that an organization can ingest and process is growing much faster than the number of resources for data engineering. In addition, the variance of data traffic in production systems can be unexpectedly high at certain times. For these reasons, the system should be scalable without compromising throughput and latency. Scalability is the ability of the system to provide a moderate performance as the load increases (e.g., high volume, high throughput or high velocity data). Scalability can refer to different dimensions: Processing, Serving or Storage. A scalable architecture means that the system should continue to function smoothly even if 1 user or 1 million users access the application. For systems, there are two ways to scale a database: 1) Vertical scaling (or Scaling up): is done by improving the Central Processing Unit (CPU), Random Access Memory (RAM), and storage capacity of the existing machine in hardware or by optimizing algorithms and application code in software. This eventually reaches its limits when the amount of data on a single machine grows. 2) Horizontal scaling (or Scaling out): is done by adding additional machines to the database cluster. Note that requests in this case not only consume CPU, but also require network resources. In this case, it is important to shard the data so that a single query can be processed on a single machine. Due to difficulties in horizontal scaling with traditional SQL databases, NoSQL and NewSQL databases with horizontal scaling options have emerged. However, they lack strong consistency guarantees and relational models (see Section V for details). The complexity of the scaling process depends on the type of service that infrastructure provides. There are two types of services: • In stateless services, scaling is usually done based on user traffic and on-demand, e.g., web server ap-plications. Deploying and scaling stateless services is relatively straightforward. • In stateful services, consistency of user data across data centers is critical for scaling, e.g., for scaling databases. Deploying and scaling stateful services is more difficult and complex because copies of the database need to be managed in different data centers.
A possible solution to reduce complexity would be to partition the database and enable different replication factors in different locations so that the overall replication factor can be reduced. This would also help reduce traffic between data centers.

B. PARALLELISM
To increase processing capacity, most data engineering pipeline frameworks exploit the potential of parallelism. Parallelism can be either over data or task selected according to the application. In data parallelism, input data is partitioned to scale processing (e.g., in low-cost HDFS). In task parallelism, multiple tasks (CPUs) are executed over the same data. In event processing, there are various forms of shuffling and data exchange capabilities within parallel computing frameworks. These mainly fall into the following categories: • Forward or one-to-one: In this case, the data is processed with the same node and there is no data exchange between the nodes of the cluster. • Broadcast: In this mode, data is broadcast to all nodes because it must be used by the tasks running on each node. • HashKey: In this case, the data is collected/grouped by key to elegantly distribute the data among the work tasks of the nodes in a cluster. • Rebalance or random: In this case, random repartitioning is used when not much is known about the data. The goal is to balance the distributions between the nodes to increase the data processing capabilities of the nodes.

C. MICROSERVICES AND DEPLOYMENT
Microservices and microservices-based architecture are some of the recent trends that enable flexibility and scalability. Data orchestration and management teams are adapting them to run and manage distributed application components. Service mesh architecture enables microservices to communicate with each other. Typical types of communication between microservices can be divided into three dimensions: 1) Event-driven based on platforms such as the message/event bus, 2) Orchestration based on REST callouts using some frameworks such as Apache Camel, 3) Orchestration based on some workflow engines like Apache AirFlow * .

D. CI/CD AND IAC
Continuous Integration (CI)/Continuous Development (CD) are terms that have been common in the industry for a decade. Platform engineers and DevOps teams are continuously developing infrastructure tools to improve engineer productivity. At the same time, data applications should also be flexible enough to be deployed by CI/CD platforms for testing and development purposes. For data engineering applications, integration can be done in three modes: 1) Data integration: This is related to data movement patterns. It can take place either at the node level or at the cloud level, and ETL is an example of this. Data integration is challenging as it has to deal with heterogeneous data sources with different sampling rates and data generation models. 2) Application integration: uses APIs such as REST, SOAP, etc. between applications and their internal subscriptions. In application integration, the amount of data exchange is not as large as data integration. 3) Event integration: combines the benefits of application and data integration, where a significant amount of data is also exchanged when events are triggered between applications. Infrastructure as a code (IaC) enables scalable infrastructure deployment via standardized interfaces such as YAML or JSON files. IaC can be implemented via containers (e.g. via Docker, LXC [102]), container orchestration (e.g. via Kubernetes [103], Docker Swarm [71], Apache Mesos [104]) and infrastructure provisioning (e.g. via Terraform). Two different models for IaC or automation are: * https://github.com/ContainerSolutions/k8s-deployment-strategies, accessed December-2021 1) Declarative model requires the user to define a desired state to be provisioned by Kubernetes. The relationship between the infrastructure and the application is declarative. Applications make declarative requests to the infrastructure (e.g., via YAML, a data serialization language and easily understood by humans and machines) with implementation details abstracted by the underlying framework (e.g., the Kubernetes cluster). One of the main advantages of this model is that once the plan is created, it is the responsibility of the framework to develop and execute the work plan in a way that is optimized for the complexity of the infrastructure. This ensures the transition from the infrastructure as code paradigm to the infrastructure as data paradigm. Any new application can be easily enriched using the declarative tag of the declarative YAML management file. 2) Imperative model requires users to specify commands or plans in a specific order to achieve and maintain a desired state, such as the service provided by Apache AirFlow. Although this model provides flexibility, imperative models have scaling issues as the number of components increases and complexity grows exponentially.

E. AVAILABLE TOOLS
For highly reliable distributed coordination of multiple machines in a cluster, distributed key-value stores such as Apache Zookeeper [98], etcd † , Consul ‡ provide reliable data stores for distributed system access. They can perform functions such as leader elections, fault tolerance for machine failures, network configuration automation, and service discovery. To ensure strong consistency and replication, consensus algorithms and protocols such as Raft consensus algorithms and Paxos protocols are used to determine the order in which data is stored and when it becomes visible to users. Data management tools such as Deequ § are particularly useful for dealing with corrupt or bad data in large datasets. MLflow [105], is an open source ML lifecycle platform, is used to manage the E2E ML lifecycle so that model-based experiments and quality metrics can be managed, tracked, and reproduced. Kubeflow [106] provides a data automation framework for Kubernetes clusters and enables connectivity with a variety of databases and services. This enables a highly modular design. Weights & Biases ¶ is another related developer tool for ML. These ML lifecycle management frameworks act as CI/CD tools in the ML domain. Some of the most popular systems for defining workflows and programmatically scheduling jobs/tasks are Apache Mesos [104] and Apache YARN [44]  run Spark jobs using YARN), Apache AirFlow (provides workflow-level abstraction for building data pipelines using Directed Acyclic Graphs (DAGs)), Dagster * (for data orchestration), Prefect * for automating data flows and creating, executing and monitoring millions of data workflows and pipelines, Spotify's Luigi * , LinkedIn's open source scheduler Azkaban * , and Apache Oozie [107] (for distributed coordination and scheduling of workflows in Hadoop clusters), Apache TEZ [108] (for building complex directed acyclic graphs of tasks to process data), Apache Ambari * (provides a management interface), Kubernetes [103] (provides high flexibility as the dominant container orchestration framework), OpenShift * (provides a web console to run tasks directly), Yunikorn * (resource scheduler for containerized systems), and Docker Swarm. Apache Calcite [109] is a dynamic data management framework and is used to as an intermediary between applications and data stores and data processing engines. Amundsen * (from Lyft), Metacat * (from Netflix), DataHub * (from Linkedin) provide both metadata management and data discovery options to improve the productivity of data professionals interacting with data. Some of these tools (such as Kubernetes) are also important tools for managing and deploying containers (e.g., container lifecycle and resources, or facilitating application development through container orchestration) in the context of network operations and management in 5G networks. Some common IaC and automation tools that can help eliminate the risk of human error, increase the speed of code development speed (by reliably building, testing, and deploying software), and reduce costs are Ansible † , Chef ‡ and Puppet § for configuration management, Terraform ¶ (infrastructure deployment orchestration tool) and Jenkins || , GitHub Actions ** as part of the CI/CD chain. These tools have significantly improved the way applications and workflows are deployed and managed within the infrastructure through configuration (installing packages, configuring servers, and deploying applications to the infrastructure) and orchestration of the infrastructure. Great expectations provides data quality, documentation, profiling and testing Vendor-specific tools: Amazon offers AWS Elastic Beanstalk, Amazon ECS (as a container orchestration framework), and AWS CloudFormation for infrastructure provisioning and IaC tool, CodeDeploy for deployment, Code-Pipeline for unit and integration testing, CI/CD pipeline, Microsoft provides Azure Resource Manager (ARM) and Pipelines for the deployment and management service and CI/CD Pipeline over Azure, TestPlans for unit and integration testing, Google provides Google Composer as a managed Apache Airflow service on GCP and Google Kubernetes Engine (GKE) ¶ ¶ , Deployment Manager for infrastructure automation, Cloud Build for deployment, unit and integration testing, CI/CD pipeline, Puppet Enterprise also offers Puppet as configuration management, HashiCorp offers Terraform for deploying and managing any cloud, infrastructure or service. Pulumi provides a modern Infrastructure as Code for building, deploying, and managing infrastructures in any cloud. For visibility and observability of data pipelines, Unravel, Fiddler and Acceldata can be used. For data modeling and analytical engineering workflow, dbt (Data Build Tool) and LookML can also be used to transform data into data warehouses more efficiently.
Considering all the above descriptions, starting from data connection to data orchestration, Table 5 shows the summary of characteristics of open source frameworks in the data engineering pipeline and the corresponding related works. In addition, some of the most important tools and their connection to the components of the data engineering pipeline framework are described in Fig. 2.

VIII. RELATION WITH DATA SCIENCE FRAMEWORKS
With the advent of DL and new computational workloads, scaling and distributed computing are becoming increasingly important in AI/ML. Data Engineering also helps data science developers build distributed computing frameworks so that they can focus on developing their own AI/ML algorithms instead of dealing with the intricacies of distributed computing. On the other hand, AI/ML based platforms build on Data Science libraries play an important role. For example, in a processing pipeline, data scientists work on defining and preparing the model, and data engineers implement the aspects that serve the model. For this reason, the ability to interact with AI/ML models (export, import, etc.) from data science tools is important.   For developing Reinforcement Learning applications, Ray provides (a fast and simple framework) RLlib for building and running distributed, parallel, scalable reinforcement learning-based applications [123], Stable Baselines † to produce a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines, Garage ‡ to develop and evaluate reinforcement learning algorithms, Coach § , a Python-based reinforcement learning framework that includes implementations of many state-of-the-art algorithms, Tensorforce ¶ , a TensorFlow library for applied reinforcement learning, ChainerRL, a deep reinforcement learning library that includes several state-of-the-art deep reinforcement algorithms in Python with Chainer [117] (a flexible DL framework), OpenAI [124] also developing their own new distributed reinforcement learning algorithms to be implemented or building their own infrastructure/tools for their applications to achieve the required flexibility and performance.
For Distributed Training frameworks such as Horovod [125] along with TensorFlow, Keras, PyTorch, and Apache MXNet, Distributed TensorFlow, for Model Serving (takes the trained model and sends predictions or recommendations to specific applications) tools/frameworks such as Clipper [126], which is used for low-latency prediction serving systems for ML when integrated with client systems, Ten-sorFlow Serving, TorchServe * , Ray Serve, or Seldon † can be used depending on use case that requires low-latency model deployment for large-scale prediction services. For Hyperparameter Search (either via manual search, grid search, Bayesian optimization, evolutionary optimization or random search), Advisor ‡ (which is an open source implementation of Google's Vizier) is used for the hyperparameter tuning system for black-box optimization, Hyperopt [127] and Tune § are used for distributed hyperparameter optimization and scalable hyperparameter tuning, respectively. For NLP, tools such as spaCy [128], Hugging Face [129], AllenNLP [130] and more recently GPT-3 [131] can be used.
Vendor This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IX. NETWORK SERVICE MANAGEMENT AND ORCHESTRATION OVERVIEW
Thanks to advances in network services, new applications such as the tactile internet, holographic-type communications and teledriving are expected to emerge in the next decade. Many of these E2E services also require high levels of precision which has significant implications for the management of these networks and services. Managing and maintaining distributed computing functions in a network environment with elevated levels of service requirements requires hundreds of operations at any given time. This makes the human-centric and standard network service management and operation solutions already used inadequate and ineffective. Therefore, automation of network management and service deployment is critical and there is a need for unified network Lifecycle Management (LCM) and orchestration across multiple administrative and technological domains. Network service orchestration enables network operators to connect and configure systems and multiple network elements through a optimized workflow. This enables the delivery of optimal services to users and contributes to automation by coordinating interactions and service flows across multi-domain, multi-layer, multi-vendor networks. Approaches that rely on intelligent automation of network operations (e.g., via the emerging field of Zero Touch network and service management (ZSM) [134] ) can make a big difference when multiple network functions need to be owned, maintained, and operated at scale throughout the network.

A. DIFFERENT ASPECTS OF NETWORK SERVICE LCM
Network service LCM addresses the necessary operations to create, deliver, manage and orchestrate network services to meet the diverse needs of end users and enterprises over network infrastructures. The goal is to guarantee autonomic network service assurance and dynamic service delivery. Some of these operations include network functions deployment, provisioning, onboarding, updating on the fly, storing, ensuring zero downtime, demand-based scaling in an intelligent and cost-efficient manner, supplying suitable infrastructure resource orchestration, anomaly detection at run time. These operations are achieved via software-based/virtualized network functions or cloud native microservices deployed across fog/edge/cloud infrastructure.
The ETSI-defined cross-domain E2E network service LCM is divided into three main processes (see [135] and Some examples of service management include E2E service orchestration (to control the service model and maintain the service catalogue), domain orchestration (to send alerts when changes are made to the catalogue and to request missing entries in the catalogue), ZSM integration fabric (to manage subscriptions, data generation and consumption) and ZSM data services (to store data and provide data persistence). • Service fulfilment procedure is used to manage E2E service instances from instantiation to termination. The following processes are provided for the provisioning of E2E service instances. (i) Service instantiation (creates an E2E service instance), (ii) Service activation (activates an E2E service instance), (iii) Service configuration (modifies the configuration of an E2E service instance), (iv) Service deactivation (deactivates an E2E service instance), (v) Service decommissioning (removes an E2E service instance and releases all its resources), (vi) Update E2E inventory (keeps up-todate information about resources and domain service instances). • Service assurance procedure is used to ensure that E2E service level requirements are met. The following processes are provided to deliver E2E service assurance. (i) Service assurance set-up (assures an E2E service), (ii) Service quality management (manages service quality), (iii) Service problem management (investigates cross domain service problems), (iv) Service assurance teardown (defines procedures to tear down the collection of information).

B. MANAGEMENT PLATFORMS
In this section, we describe two of the most popular management and orchestration platforms.  [136].

X. NETWORK MANAGEMENT AND ORCHESTRATION IN STANDARDIZATION
In recent years, several standards organizations, independent alliances, and forums have been involved in developing standards for developing platforms that work with AI and ML. In addition, zero-touch network management and orchestration frameworks, which are based on significantly simplifying the tasks performed by human to manage and orchestrate network slices, are currently being extensively researched by standardisation bodies. Standards organizations such as Open Radio Access Network (O-RAN), ETSI, The 3rd Generation Partnership Project (3GPP) are working on embedding intelligence into emerging next generation architectures to efficiently meet the diverse needs of communication network users. In this subsection, we provide an overview of some of the work that has been done in these SDOs to build a AI/ML platform. A good overview of existing standardization efforts related to AI for 5G systems as well as some of the identified gaps in standardization, are also summarized in [137].

A. O-RAN'S AI/ML ARCHITECTURE
O-RAN alliance aims to define a next-generation radio access network (RAN) infrastructure based on softwaredefined technology and general-purpose hardware, driven by both intelligence and openness at every layer of the RAN architecture [138]. O-RAN currently offers an attractive solution for creating next-generation multivendor networks that embrace the concepts of programmable, open, collaborative, and intelligent communications. Therefore, AI and ML are the main pillars for the realization of O-RAN. O-RAN high-level architecture can be divided into two layers, namely Service, Management and Orchestration (SMO) and radio access site as shown in Fig. 3 [139]. In the radio access entities, there are RAN intelligent controllers (RICs) (near real-time (near-RT), non real-time (non-RT)), (the vertically divided control (CP) and user (UP) planes of the central units (CU)) as well as open interfaces that interconnect the O-RAN nodes. Both near-RT and non-RT controllers are introduced to extend the existing network functions with more embedded intelligence within the O-RAN architecture. Near-RT RIC is interfaced with * https://osm.etsi.org/, accessed February-2022 a centralized unit control plane (CU-CP) for transmission of signals and configuration messages and centralized unit user plane (CU-UP) for data transmission and can be used for control loops on the order of ms time scale. Non-RT RIC is interfaced with near-RT RIC via interface A1 (for policy management and coordination) and to the CU-CP, the Distributed Unit (DU) and the Radio Unit (RU) via interface O1 and can be used for control loops on the time-scale in the order of greater than 500 ms [140]. In addition, applications (xApps/rApps) are introduced to be hosted either on the near-RT RIC or on the non-RT RIC depending on how sensitive the applications are to control processes.
In the architecture, different interfaces (O1, A1 and E2) are used depending on the results of the AI/ML algorithms, the actions and the actors. For example, the O1 interface is used to configure Control and Data Units and near-RT RIC for fault and performance management. A1 interface enables non-RT RIC to provide RAN optimization functionalities to near-RT RIC functions. These functionalities include policy management, ML model management, or data enrichment. The E2 interface is used for communication between near-RT RIC and centralized/distributed units of RAN in the O-RAN architecture. AI/ML algorithms can run on top of near-RT RIC (e.g., xApps for energy, resource or beamformer optimization, traffic steering) or the non-RT RICs (e.g., rApps for RAN automation applications such as network deployment, optimization, frequency band selector). These algorithms can also be reconfigured based on data availability, control timescales, network load, and overall mobile operator requirements.
Data pipeline generation via O-RAN architecture: To enable automated and intelligent network functions, O-RAN architecture has been standardized to include three types of control loops (categorized by the time sensitivity of the required decision-making process) and AI/ML dedicated nodes. The first control loop is responsible for scheduling at the Transmission Time Interval (TTI) level and operates on a time scale of TTI ms or above. The second control loop operates in the near-RT TIC and operates within the range between 10-500 ms and above. The third control loop operates in the non-RT RIC and makes decisions at a time greater than 500 ms (e.g., for policy and orchestration purposes). These control loops can also operate in parallel. Offline/online training and inference can be supported via O-RAN components such as SMO, non-RT and near-RT RICs.
In line with recent developments in data engineering, a mapping can be made between the components of O-RAN and the existing data engineering ecosystem. The O1 interface is used for data collectors and preprocessing entities within the service and management orchestrator (e.g., within the Open Network Automation Platform (ONAP) [141]). The O1 interface can be used to transmit RAN metrics associated with the performance of the RAN nodes to the SMO. In addition, non-RT RIC placed in the SMO can enable RAN optimizations using the collected RAN metrics VOLUME 4, 2016 FIGURE 4: Overview of the ETSI ENI reference points and the architecture and mapping to the components of the data engineering pipeline. and contextual (external) data. The SMO can later control the RAN and apply configuration changes. Fine-grained data collection is performed via the E2 interface, to enable near-RT control and optimization of RAN elements.
Data collectors and preprocessing entities may use the previously defined data connection frameworks described in Section II. The Virtual Event Streaming (VES) collector in O-RAN * is used as a telemetry collection interface and supports data gathering from O-RAN. Subsequently, the collected data can be shared with non-RT RIC using the data ingestion frameworks described in Section III.
AI/ML models trained in AI/ML platforms can later be queried by non-RT RIC via batch processing or Big Data query engines. Regular AI/ML workflow including model training, inference and updates, batch data processing, big data query generation processes and policy-based guidance of applications/features can be performed using the data analysis and processing frameworks defined in Section IV. When near-RT is subject to an action, the ML inference results or policies/intents can be transferred via the A1 interface to near-RT RIC, where near-RT RIC can apply streaming, real-time, or interactive analytics to the dataset. Appropriate configurations can then be applied to control or data units via the E2 interface which establishes communication between the lower RAN modules (RU, CU and DU) and the near-RT RIC and is used for time-sensitive control of the RAN components.
For performance monitoring, performance data from the ML models deployed either in near-RT or non-RR RICs can be linked to the data visualization and monitoring frameworks. By monitoring these relevant performance metrics, decisions can be made such as whether or not a model retraining is required. These update decisions can be triggered by either a rule-based policy (e.g., threshold-based) or by using trend analysis approaches.

B. ETSI'S AI/ML ARCHITECTURE
ETSI has several specification groups working on embedding intelligence into network services and management infrastructures. ETSI's Industry Specification Group (ISG) ZSM aims to develop a new horizontal and vertical E2E architectural framework designed for closed-loop 100% automation and optimized for data-driven AI/ML algorithms [134]. An extensive list of requirements for zero-touch management that contains more than 170 topics on autonomic management is already defined by the ETSI ZSM group. [142]. In ETSI's ISG NFV, the AI/ML platform will eventually be considered as part of the MANO stack. The ETSI Technical Committee (TC) Core Network and Interoperability Testing (INT) is investigating the Generic Autonomic Network Architecture (GANA) for the purposes of autonomic networking [143]. Similarly, the ETSI Experiential Networked Intelligence (ENI) group is working on the application of AI in telecommunication networks to help operators manage infrastructure and provide more resilient services offered to end users and has also recent published standards [144]. Experiential learning is learning through experience. The ETSI ENI architecture uses AI techniques and policies driven by context awareness and metadata so that the services provided can be adapted to environmental conditions, user needs and business goals. The main goal is to develop a control loop based on the "observe-orientdecide-act" model. Fig. 4 shows an overview of the ENI architecture and reference points and their corresponding mappings with the presented data engineering components. API broker in the middle is an optional functional block and acts as a gateway (i.e. translator) and maps the data connection framework of the data engineering pipeline. Within the ENI system, there are several function blocks that mainly represent the management and application components that are connected to the semantic bus. These functional blocks can be implemented as part of the data analysis and processing and data storage framework of the data engineering pipeline. For example, situation awareness blocks are used to detect events and behaviors in the ENI system and its environment. In the case of high traffic and large amounts of information, the corresponding streaming application used in the data analysis and processing framework is an important differentiating factor. The policy management functional block allows users to create and edit policies so that consistent and scalable decisions can be made about the system behaviour. The model-driven engineering block uses a set of domain models that abstract all concepts related to managing objects in the ENI system. Therefore, both the policy management functional block and the model-driven engineering block can be mapped to the data storage framework in the data engineering pipeline. A comprehensive overview of ETSI ENI on using AI techniques for network management and orchestration can also be found in the references [145], [146].

C. ITU MACHINE LEARNING PIPELINE
International Telecommunication Union (ITU)'s FG-ML5G group is working on ML pipeline [147]. Fig. 5 shows an example realization of the high level architecture in an IMT-2020 network and the corresponding mapping. In this pipeline of Fig. 5, there are several nodes for creating ML pipelines: • source (represented by SRC) is a node that generates data to be used as input to the ML function. • collector (represented by C) is a node responsible for collecting data from SRC. • pre-processor (represented by PP) is a node used for pre-processing the data • model (represented by M) is a ML model used for prediction (note that training of the model is performed in a sandbox (not shown in Fig. 5)). • policy (represented by P) represents the control mechanism for improving the operation • distributor (represented by D) is responsible for distributing the ML results to the corresponding sinks • sink (represented by C) is the target node of the ML output where the action is performed (for inference purposes). Note that in Fig. 5, some subsets of nodes (e.g., PP, FIGURE 6: Functional framework of the 3GPP 5G system to support management and network data analytics services and mapping to the components of the data engineering pipeline. M, P, D) are inside the ML pipeline and are not shown. In Fig. 5, latency-sensitive applications use "ML pipeline 1", while latency-tolerant applications use "ML pipeline 2". Inputs from UE are processed by the ML pipeline represented by arrows 1-> 2-> 4-> "ML pipeline 2", to make predictions for the Core Network (CN) (e.g., MPP-based ML applications). In the ML pipeline arrows represented by 5-> 4-> "ML pipeline 2"-> 6, the inputs of CN (as well as some combinations of UE inputs) are combined to make some predictions in the CN so that these actions can be performed by management functions (e.g., actions such as Self Organizing Network (SON)-level decisions at CN or closed-loop decisions about resource allocations in the network). In the ML pipeline arrows represented by 1-> 3-> "ML pipeline 1"-> 7, the inputs of UE are used for latency sensitive applications in the access network, where model hosting and serving are also performed. When mapped to the corresponding data engineering pipeline frameworks, the connection of SRC to collector (represented by C) is via the data connection module, where collector can be selected from data ingestion framework and ML pipeline 1 & 2 falls into the category of data analysis and processing frameworks.

D. 3GPP NETWORK DATA ANALYTICS FUNCTION
The architecture framework for 5G management and orchestration is specified in 3GPP. TS 29.520 in R-16 is the standardization effort of 3GPP for 5G network automation using ML and data analytics. Within the latest approved 5G specification in 3GPP (Release 15), 3GPP identifies two main building blocks responsible for data analytics [148]: 1) NWDAF (Network Data Analytics Function) collects data from core network functions and provides network data analytics services to other Network Functions (NFs) of the 5G Core which are subscribed as NWDAF consumers [149]. The NWDAF offers two services (called Nnwdaf services). The first is called N nwdaf _EventsSubscription service, which allows NF service consumers such as PCF, OAM to subscribe or unsubscribe to various analytics events provided by the NWDAF. The second is called N nwdaf _AnalyticsInf o service, which is used by NF consumers to request and receive specific analytics from the NWDAF. Hence, through a service-oriented interface N nwdaf , other NFs can access analytics information. In 3GPP Rel. 15 Fig. 6 shows the functional framework for the management and network data analytics services in 3GPP 5G systems and the corresponding mapping to the defined components of the data engineering pipeline. Data connection and data ingestion modules are located within Service Based Management and Control Interfaces. Note that both the NWDAF and the MDAF provide data analytics. Data analysis and processing frameworks can be deployed in NWDAF and MDAF, while data ingestion frameworks are deployed as part of the service-based management and control interfaces. Using data visualization and monitoring tools (e.g., Grafana, Kibana), graphical dashboards can be used to provide charts and notifications on the current operational status of each monitored source as well as analytical results.

E. OTHER ACTIVITIES AND SUMMARY
There are also other industry alliances in GSMA * , BDVA † , and TM-Forum ‡ that are also working on specifications of AI in larger domains including telecommunications and their corresponding gap analysis.
In summary, recent advances in the standardization bodies have brought their own ideas and proposals for shaping the possible integration options of the AI/ML platform with the network infrastructure. On the other hand, recent technological advances in data engineering are progressing rapidly and the novelties and new functionalities of each framework in data engineering have not yet been sufficiently explored in the telecommunications standardization bodies. Similarly, a clear separation of data engineering pipelines within the proposed architectures has been neglected.

XI. DATA ENGINEERING USE CASES IN NETWORK MANAGEMENT AND ORCHESTRATION
There are several use cases in the telecommunications industry where the data engineering frameworks described above can be applied. Some of the relevant use cases are also discussed within the standardization bodies as well as the alliance organization in ETSI [1], 3GPP [2], ITU [3], GSM Association (GSMA) [4]. Some descriptions of them are as follows.
• ETSI ENI document [1] has classified use cases in four different dimensions: (i) infrastructure management (energy optimization using AI, handling planned peak events), (ii) network operations (intelligent fronthaul management and orchestration, radio coverage and capacity optimization), (iii) service orchestration and management (closed-loop (autonomic) faultmanagement, autonomic performance management, context-aware service experience operation, intelligent network slicing management) and (iv) assurance (network fault identification and prevention, assurance of service requirements) * https://www.gsma.com/artificialintelligence/applied-ai-forum/, accessed December-2021 † https://www.bdva.eu/sites/default/files/AI-Position-Statement-BDVA-Final-12112018.pdf, accessed December-2021 ‡ https://www.tmforum.org/ai-data-analytics/, accessed December-2021 • ITU document [3] has compiled more than 30 usecases and their requirements. The use cases are divided into five categories: (i) Network slice,service, (ii) User plane, (iii) Applications, (iv) Signaling management, (v) Security. The requirements are divided into three categories: (i) Data collection, (ii) Data storage and processing, (iii) ML applications. • GSMA report [4] has detailed typical seven different use cases for intelligent autonomous networks in China: (i) AI for network planning and deployment, (ii) AI for network maintenance and monitoring, (iii) AI for network optimization and configuration, (iv) AI for service quality measurement and improvement, (v) AI for network energy saving and efficiency improvement, (vi) AI for network security protection, (vii) AI for operational services. Data engineering solutions can help provide closed-loop automation, self-organizing, self-healing, self-decision making and self-optimizing network solutions for a variety of problems in network management and orchestration, network planning and design, network construction, network optimization, and network operations. The scope of data engineering solutions can be diverse in wireless, fixed networks, core networks and data centers. For example, considering that about 2000 parameters need to be optimized in 5G networks [150], network automation using the recent advances in data engineering and data science becomes crucial factor to bypass the human-based optimization process. In [151], [11], several novel use cases for (wireless) network design using DL and AI capabilities are presented. Below are some examples where data engineering frameworks can enable or influence their functionalities: • Using data connection frameworks, providing API gateways for network providers, • Using data ingestion frameworks, real-time monitoring applications for hardware (routers, switches, other network devices), software and security (threat discovery and mitigation, DDoS, etc.), data distribution (multimedia distribution (IPTV, content delivery service, etc.), text messaging service, chatbots (for quick access to inquiries and information (e.g., known faults, etc.))), OSS/BSS-related functionalities (providing real-time information (inventory/assets) to supply retail stores as part of supply chain management, billing services, network fault ticket management, network alarm management), • Using data processing and analytics frameworks, leveraging AI/ML algorithms for CDR processing for churn analysis, real-time alarm correlation to identify and predict network faults, root cause analysis for network faults, preventive maintenance, anomaly detection, Customer 360 (tracking and analyzing user behaviour/needs and their interactions with channels), inventory/asset tracking as part of B2B products, recommender systems, network analytics, contact trac-ing, fraud detection, traffic flow prediction, intelligent network and service slices management and configuration, intelligent initial access and handover at RAN, context-aware service experience optimization, intelligent carrier management SD-WAN, SLA path adaptation for network delays, intelligent software rollouts, energy optimizations with AI, policy-driven IPmanaged networks, intelligent fronthaul management and orchestration, service requirement assurance, federated learning for privacy awareness, transmission optimization, opportunistic data transmission in vehicular networks, predictive power management, automated scaling of VNFs, automated deployment of network service slices, automated site design based on coverage and capacity. • Using data monitoring and visualization frameworks, visualization of transport network equipment to enable data-driven infrastructure decisions, visualization of service mesh topology to monitor traffic flow and metrics display. Finally, note that each of the frameworks or their interconnected versions in the Table 5 can be deployed at different levels of the network depending on the requirements of use cases. These levels can be divided into three levels: • In node level operation, data is collected, processed or analyzed at individual nodes (e.g., at UEs or device level) and no network connection is established. This is useful for data security and privacy, reducing latency and complexity (since the data is processed at the device level). However, since the processing is done at the node level, performance limitations in the analysis results are expected. • In network level operation, data is collected, processed and analyzed within a single domain of the network (e.g., at RAN or in the core network). This increases data diversity because the data catalogue contains data from multiple nodes in that domain. (e.g., from AMF, User Plane Function (UPF), etc. in the 5G core domain) leading to better performance optimizations. On the other hand, data security/privacy and delays are some of the drawbacks of network level processing. • On the other hand in global level operation, data collection, processing and analysis are done with complete knowledge of the data sources in the network in different domains (e.g., E2E network service management).
One main benefit of this approach is high performance. On the other hand, there are some issues related to global data integration and deployment cost in this way of operation.

XII. GAP ANALYSIS, CHALLENGES AND FUTURE DIRECTIONS
Most telecommunication companies today require more comprehensive solutions that address both the complexity of their infrastructure and the intense needs of their users. Since data engineering technologies are younger compared to traditional telecommunication services and products, a few telecommunication infrastructure providers are aware of the capabilities of data engineering solutions. However, there is a growing number of use cases and a growing community and interest in early and rapid adoption of data engineering tools and technologies in the telecommunications world. Some early case studies have highlighted some of the existing gaps and challenges in the adoption of Big Data analytics in the telecommunication industry [152], [153], [154], [155], [156], [157]. As telecommunication providers try maintain their status quo, they are at risk of being left behind with their products and services in a rapidly changing ecosystem. In this section, we explore the potential gaps, challenges, and future directions in the adoption of data engineering approaches in the telecommunication industry.

A. GAP ANALYSIS
Our survey results show that there are several gaps between developments in the world of data engineering and telecommunications. These can be summarized as follows: (i) Data engineering framework deployment risks: It is critical for telecommunication infrastructure and service providers to understand the advantages and disadvantages of the technologies currently in use and the emerging cuttingedge technologies in the data engineering world. There are stringent requirements for operational efficiency, availability, reliability, robustness, and stability of telecommunication networks when systems are deployed in production environments. For this reason, the risks associated with the potential deployment of these emerging technologies within a mature and traditional telecommunications infrastructure must be fully assessed. For example, in the case of deploying ML models, the inherent randomness of ML systems may make it difficult to achieve reproducible results or workflows across different experiments [158], which may affect the reliability of the overall deployment process of the data engineering pipeline.
There are several industry players such as Nokia, HPE, Juniper, etc. that already offer network solution products (mostly in the wireless area) that leverage BDA as listed in Table 3 of [32]. However, most of the newer products are not very mature and not fully tested in production environments. Therefore, most of the new AI/ML-based technologies, e.g., advances in DL based neural network architectures, have not yet been tested in real-world applications for telecommunication applications. (This is due to their low Technology Readiness Levelss (TRLs)), which makes it difficult to assess their potential adoption. One way to monitor failures or potentially problematic scenarios in data engineering/science projects and their use in production is to monitor their adoption in other industries so that they can be intelligently adopted in the telecommunication domain. For example, some of the challenges described in [158] in developing pipelines for data management, model learning, verification and deployment in various domains such as computer vision, human-in-the-loop neuropathology, etc. may also be useful for telecommunication operators. TRL assessments of some of the most important and representative AI technologies, as listed in [159], can help telecommunication operators better understand the state of the art of a particular technology when planning to adopt it in telecommunication infrastructure. The integration and scaling issues raised in [160] when deploying AI/ML in a cross-organizational context (e.g., between a hospital and a service provider) may be useful for telecommunication operators when integrating AI/ML platforms into their vertical industries [161], [162].
For this reason, it will be valuable for telecommunication providers to have first-hand knowledge of the shortcomings of these platforms, recognize the weaknesses of these systems, adopt the good parts of the most needed frameworks, and learn from past experiences. This will allow them to adapt to the changing landscape of data technology with less operational and technical complexity.
(ii) Operational costs: Introducing new features and capabilities related to emerging use cases within the telecommunication infrastructure is attractive but at the same time can be costly due to operational expenditures. For example, IT and cloud giant Google has shown that deploying MLenabled systems real world incurs huge ongoing maintenance costs due to a variety of tasks such as configurations, data collection, feature extraction, data verification, analytics tools, machine resource management, process management tools, serving infrastructure and monitoring in addition to creation of ML code [163]. Moreover, operations units and network engineers working on the day-to-day operation of telecommunication networks should have additional skills such as data modeling, software engineering and system design, ML libraries, etc. On the other hand, data engineers, data scientists, and ML engineers working in the telecommunication world should acquire domain expertise in the functioning of the legacy systems so that accurate modeling of these systems using the data landscape is possible.
(iii) Support for services: Traditional telecommunication infrastructures and vendor-based solutions are mature technologies with advanced enterprise support in case of service outages or activation of new features. On the other hand, many enterprises also rely on "microservices" because they are highly flexible and can be easily developed to meet dynamic business needs. Microservices based design requires constant communication between the various components of these services to keep them in sync with each other.
On the other hand, the data engineering ecosystem is still young and rapidly evolving. Hence, some open source technologies developed within the data engineering ecosystem may have a larger community and advanced ecosystem compared to vendor-based data analytics solutions. For example, Big Data technology providers such as Cloudera/Hortonworks or MapR offer support services and subscription service models for their open source toolboxes. However, deploying these open source data engineering technologies as a telecommunication service may require extensive support from internal teams such as OSS or CRM departments of telecommunication providers. The authors in [164] have shown that providing reliable data science services (e.g., when simply combining open data from different open APIs) can pose several software engineering challenges to enterprises.
(iv) Performance guarantees: Mobile and fixed networks have different SLA guarantees, as mobile networks consist of RAN, transport and core networks. For example, RAN is prone to interference and complex propagation environments that can lead to unpredictable outcomes. Therefore, in different network scenarios and conditions, algorithms within different layers in the protocol state (e.g., in the RAN domain, L3 algorithms (load balancing, mobility and session management, etc.) and L1/L2 algorithms (power control, link adaptation, scheduling, etc.)) aim to improve telecommunication-specific KPIs collected from various data sources (flows, logs, streams, databases, etc.) and bring network conditions to a steady state.
At the same time, in the world of data engineering, other KPIs such as scalability, latency (which measures how close the system is to delivering streaming and messaging in real time, for example P95 latency of less than 5 ms means that 95% of all data processing requests should complete in less than 5 ms), input rate (how much data flows from a system like Kafka or Pulsar in one second), processing rate (which indicates how fast data analysis can be performed), etc. For example, if the input rate is greater than the processing rate, the system lags behind, so scaling within the cluster must be done to handle the greater data load. For this reason, data engineering and telecommunication-specific KPIs are interrelated. However, there is currently no standard way to combine network-specific KPIs with data engineering KPIs. The specification of the common KPIs and the resulting necessary SLAs remains an open research area, depending on the different data-related use cases in such an integrated production system.
(v) Customized functionalities: Some of the custom demands from the telecommunication world would be difficult to implement in the data engineering ecosystem, as the convergence of these requirements may be different. Therefore, some use case may require more effort and investment. Some examples that require complex functionalities in telecommunication world are response time less than 1 ms and reliability of above 99.99999% in connected autonomous solutions (e.g., drone delivery systems, drone swarms, etc.), guaranteed microsecond delay jitter in industrial automation and robotics solutions [165], [166], data rates up to 100 Gbps for highly mobile hotspots [167].
(vi) E2E ML lifecycle management: Due to exponential use of AI/ML technologies in software and hardware systems, the development and deployment of ML systems currently tends to be rushed, isolated from real-world environments, and without the context of larger systems or broader products into which they are to be integrated for VOLUME 4, 2016 deployment [158]. For this reason, current ML project lifecycle processes and guidelines do not follow clearly defined processes and testing standards that facilitate the development of high quality and reliable results. This is also true for development in the telecommunications sector, which relies on AI/ML technologies.
In a typical telecommunications system, typical concerns such as data, experimentation or model management, deployment, reproducibility, and testing & monitoring should be considered depending on the ML platform. In a ML project lifecycle management as described in [168], all business requirements and goals of the project must be defined first before the project starts. After the business requirements are co-decided and the project objectives are defined, data collection and preparation phase follows. Then come the feature engineering and model training stages (in the case of DL, these are grouped under the term model training) and model evaluation are performed during the AI/ML training process. After the best model is selected, the model deployment, model serving, model monitoring, and model serving stages need to be executed sequentially. Depending on which execution stage, different feedback must also be provided to the previous stages. For example, in training phase, model evaluation stage provides feedback to the model training, data collection & processing, and even to goal definition stages. Similarly, model maintenance can provide feedback to the model training and data collection & processing stages. A lean Machine Learning Technology Readiness Levels (MLTRL) framework for developing and deploying robust, reliable, and responsible ML systems, as proposed in [158], can also be used in a telecommunications engineering project.
(vii) Data collection and preprocessing: The data collection/gathering must be continuous to keep resulting ML model up-to-date and compatible with practical infrastructure and systems [169]. Most ML models must function in dynamic data environments in production. For this reason ,"concept drifts" (i.e. the degradation of model performance due to less similar data in production on which the model was trained) are likely and can affect the accuracy and reliability of the model over time. Therefore, building robust ML models in a changing mobile environment is different and requires active (continuous) learning. For this reason, continuous training has been proposed as part of the MLOps practice to re-train the production model [170], [169] frequently as new data becomes available or model performance degrades.
Uniform and homogeneous data collection from all components of the network (NFV, IoT, 5G, etc.) and the data discovery process remain a significant gap between practice and ongoing efforts in both standardization and framework development. Not all vendor-provided functionality can be standards-compliant or has clear interfaces for acquiring data for analytics services. At the same time, obtaining real data can be time consuming and complex, especially when it comes to obtaining the accurate accurate data set from a variety of sources (data streams, logs, databases, etc.) and extracting useful information from it. Data may be dirty, not easily accessible (e.g., proprietary) or not available at certain times within production systems. In addition, the frequency of sampling the system and temporally stationary conditions of distribution over the sampled data are also critical to data collection and must be investigated depending on the application.
In parallel with data collection, data stored in Data Lakes needs to be cleaned and categorized as it may be incomplete, not correctly normalized or labelled, and noisy. These data should also be prepared for further analysis in the data engineering pipeline applications (e.g., in data processing and analysis frameworks for AI/ML algorithms). Some of the preprocessing activities are handling missing data (imputation of missing values for numeric and categorical data), data scaling, e.g. min-max scaler, standard scaler, max abs scaler, robust scaler, power transformer, quantile transformer, normalizer, etc. * , outlier data, transforming data types, dimensionality reduction, identifying numerical and categorical features, encoding categorical features, feature engineering/selection, sampling tasks, to name a few are other missing dimensions in the current data engineering architectures for network management and orchestration. These aspects are critical as missing values or incorrectly populated datasets can lead to inconsistent analysis results.
As a general rule, less than 1% missing data is trivial to handle, 1-5% missing data can be manageable, 5-15% missing data requires sophisticated methods to handle, and more than 15% can seriously affect any kind of interpretation [171]. Some of the most popular solutions are either removing the data (which leads to information loss and biased assessments) or using some advanced imputation techniques and maximum likelihood methods [172]. Multiple imputation using chained equations (MICE) [173] and factor analysis of mixed data (FAMD) [174] are some of the commonly used imputation techniques for hybrid missing data sets (e.g., in both categorical and continuous data). Recent developments and solutions using DL such as DataWig (which trains a neural network based classifier to predict the missing values) can also be used for scenarios with many missing observations [175]. As the authors note in [176], while there are many approaches that deal with missing values, they are mostly designed for matrices only. However, in many real-world applications, the data is not only available in numeric format, but may also be in textual form or as an image. Another main problem with the above traditional approaches is that rare values that are common in heavy tailed real-world datasets cannot be accurately identified with the trained models [177].
In particular, mobile data collected from network devices is often subject to redundancy, loss, mislabeling or class imbalance, and thus needs to be preprocessed before it * https://scikit-learn.org/stable/modules/classes.html#modulesklearn.preprocessing, accessed: July 2021 can be used directly for training. Compared to traditional ML, DL methods that process missing values in batch mode, telecommunication systems are more dynamic and more missing data is received in less time per second due to the nature of networks and wireless communication infrastructure. Therefore, special care must be taken when processing dynamic and fast telecommunication traffic.
(viii) Model training with real data: The goal of model training is to optimize and rapidly converge model parameters to optimal and consistent values given a set of training data. However, developing a model to meet specific product requirements may require combinatorial search for parameters, variables, etc., which can become difficult as the model becomes more complex. Training models for production systems can be costly due to a lack of either data or properly labeled data (especially for supervised trained models that require data augmentation, e.g., labeling large amounts of data, experts experience). In such scenarios, simulated data can be used to train the models, but the gap between practice and theory, i.e. the lack of adequate knowledge transfer from simulation to real conditions, can be a major problem.
Training requires observing a wide range of scenarios and in the case of network management and operations, generating different network configurations and scenarios that can potentially disrupt network operations for training purposes. For example, in reinforcement learning applications, the agent must interact with the environment as it tries different actions and receives feedback to improve the outcome based on its actions. During this process, the agent makes several mistakes while learning which requires a large number of steps to converge to an optimal or nearoptimal solution. For this reason, the training of the agent is not performed in the real infrastructure, since errors can have serious consequences for the networks (failures, false alarms, downtime, etc.). Instead, a simulator can be built that mimics the real network environment, and the agent is trained offline in this simulator environment. However, the simulator must meet high fidelity requirements because the real network may be different from the network used for training. Moreover, the agent cannot operate appropriately if the discrepancy between the real and simulated environments is large [178].
In the case of ever-changing mobile network environments, models of ML should have the ability to learn continuously (active learning discussed in Section XI. B bullet no. (xviii)) or perform transfer learning [179]. Transfer learning can accelerate the training process when the conditions in the mobile network change significantly. Basically, transfer learning aims to reduce the amount of training data required to learn a task, by reusing the feature extraction layers learned on other datasets [179]. In other words, it enables the rapid transfer of knowledge from pre-trained models to other types of datasets. Transfer learning can be used to improve the performance of models that learn with limited data. For example, if software viruses are spreading rapidly in the network, the anomaly detection model or antivirus software detection model built into the network equipment should be able to respond to these attacks in a timely manner with the limited information available.
In transfer learning, a model learnt in a particular environment (e.g. cellular Base Stations (BSs) operating at low-bands, < 1 GHz) can be transferred and adapted to another network node operating in a different environment (e.g. cellular BSs operating at mid-bands (1-6 GHz) or high-bands (> 20 GHz)). This is technically referred to as frequency-based transfer learning in [180]. When the system dynamics in the environment change, e.g., due to a device malfunction, a different, previously unknown terrain, different frequencies, etc., the result is a completely different data set than the one previously used for training. Therefore, the previously trained model must be trained again in this new scenario/at this new frequency, which is inefficient because all these datasets must be acquired again.
Transfer learning approaches applied in wireless communication aim to tune the existing models with a small amount of data in the changed environment [181]. For example, frequency-based transfer learning transfers the models trained on different frequencies to the target frequency, while scene-based transfer learning transfers the models trained on different scenes to target scenes that use the same frequency [180]. The authors in [180] also showed that both frequency-based and scene-based transfer learning models can predict path loss with small errors by using limited data of the new environment and learning the regularities between path loss and scenario information in detail.
(ix) Model deployments in real-world environments: Comprehensive monitoring of system behavior and taking automatic actions (without direct human intervention) are crucial for higher system reliability in the long run. On the other hand, deploying models in production is still not an easy task. There are pre-deployment, deployment, and non-technical challenges during the deployment and during the operation of ML models in practice [182]. According to the authors in [183], the main challenges are mainly related to model integration (operational support, code and model reuse, software engineering anti-patterns and mixed team dynamics), model monitoring (feedback loops, outlier detection, custom design tools) and model updating (concept drift and continuous delivery). Note that in a production environment dozens or hundreds of models may be running simultaneously. Therefore, the developed ML models should be monitored, alerted and automatically recovered in case of failures to achieve a certain service level goal.
Depending on the area where improvements are needed in terms of intelligence, use case requirements and static/dynamic characteristics of the environment, the update frequency of the ML model may also vary. For example, choosing an optimal threshold in auto-encoder-based neural networks (e.g., in anomaly detection applications) is important to achieve a good trade-off in certain metrics such as precision and recall [184]. Thus, if a new VOLUME 4, 2016 training/validation dataset is created frequently, the optimal threshold and the corresponding model must also be updated frequently to cope with changes in the state of outside world. In the case of dynamic network environments (e.g., RAN algorithms using L1 to L3 transmission parameters, modulation and coding schemes, resource allocations, etc.), the model and corresponding hyperparameters should also be updated frequently (fast time scales on the order of seconds/milliseconds), as the ML models can degrade or exhibit biases and user behaviour may change over time [185]. In network optimization, the model update frequency can be on the order of hours/days/weekly (e.g. hyper-parameters for SONs algorithms). However, for network design, this update frequency can be on the order of weeks/months (e.g., when deploying new cell in a given geographic region) (see Figure 1 of [185] for more information on the main areas of performance improvement). For this reason, depending on the scenario considered, an appropriate feedback loop for the deployment of the model is also required to achieve good and timely results.
As frequency of ML model usage in a service provider increases, many models need to be supported either sequentially or concurrently by a model server. Several deployment options are available, such as A/B testing (one set of data is used by one model and the remaining is used by another model) [186], ensembling (combine multiple models to get a stronger model) [187] or cascading [188] (make predictions based on a model (e.g., detect anomalies within the infrastructure or find the root case). Depending on the requirements and the complexity of the deployment pattern, different options for the selection of algorithms, architectures, tools, etc. need to be defined. For example, unsupervised learning algorithms may offer lower latency and cost savings at the expense of performance degradation in data analysis and processing compared to supervised learning algorithms. As an alternative solution, the developed ML models for production systems can also be embedded in the operating system kernel and provided as a system service.
(x) New architecture for event driven applications: Traditional telecommunication systems and their legacy applications are based on an application architecture that are using APIs [189]. A common gap in the traditional telecommunication network architectures is that they were not originally designed for event-driven applications, although recent efforts on Service Based Architecture (SBA) in the 5G core network have shown some tendencies towards their deployment [190]. On the other hand, data-driven architectures for telecommunication systems are based on using huge amounts of data and transferring them to a AI/ML platform for large scale analytics [14]. However, this may disregard some of the already existing application capabilities, such as enterprise integration capability or agility. For this reason, a balance is needed between a data-driven architecture (which specializes in transferring large amounts of data between applications) and an application architecture (which ensure that the functionality of one application is executed in response to a request from another application).
The general approach in the industry is to move to stateful, event-driven and event-time-aware processing using the concepts of events, streams, producers, and consumers. Many industries have already started to move from a monolithic architecture to a microservice architecture for scalability and maintainability reasons [191]. Event-Driven Architecture (EDA) has already proven itself in the cloud and IT communities and will become the software architecture paradigm of choice in the telecommunication domain in the coming years. Together with the introduction of the concept of SBA in the 5G core [192], it is expected that it will soon be used in the architecture of mobile networks.
An event is a change of state or an update in the system. EDA uses a sequence of events to trigger and control communication between microservices (i.e., decoupled services). It is particularly suitable for applications based on microservices interconnected by fast asynchronous events [193]. In an event-driven system, there are collections of independent services between which there is no direct coupling. The data schema is the only dependency between them. This increases the resilience (since failures in a service do not escalate) and extensibility (easy addition of new independent services to the existing systems, e.g., notification service) of the systems. Therefore, an EDA can successfully provide streaming, Pub/Sub, and Push patterns, while web services with REST/HTTP, API gateways, cronjobs, RabbitMQ, Kafka or data at rest with a Data Lake cannot. As a result, enterprises are starting to adapt to EDAs or event sourcing.
There are several ways in which events captured in realtime can be useful for data analysis (e.g., by correlating events with other introduced features, recent incidents, etc.). For example, in most streaming frameworks (e.g., Spark Streaming), window operations are performed as each message is received by the streaming processing framework (i.e., in processing time). However, the exact way to customize window operations is to support more advanced event time windows and perform computations based on event time, i.e., when the event was created [54]. Another good example of the application of EDA in telecommunication systems would be Network Management System (NMS), where critical events can be quickly responded in order to mitigate the problems in the network. The general workflow associated with EDA for this example could be as follows: (i) NMS detects an anomaly and publishes an Anomaly-Detected event (ii) The Root Cause Service subscribes to the event, processes it and computes the location and root cause of the anomaly (iii) The Root Cause Service then publishes the RootCause event (iv) The Region Support Service subscribes to this event and sends a notification to personnel in that region explaining the root cause of the problem.
To tackle this complexity in the management and control plane of the telecommunications infrastructure, there are several automation tools that can manage the network service management lifecycle (e.g., Open Source MANO (OSM) * , Open Network Automation Platform (ONAP) † , Cloudify ‡ , etc.). However, they also require complex MANO procedures. For this reason, specialized skills (e.g., network virtualization, cloud services, etc.) are required to fully exploit their application potential and seize the opportunity to develop new and innovative value-added services. At the same time, these new technologies also bring their own specific challenges and obstacles when it comes to integration with data engineering frameworks. Therefore, the tools and libraries selected from the data engineering ecosystem should be well integrated with the broader telecommunication infrastructure systems based on the use cases and requirements. The support of the data engineering ecosystem or community for high quality tools and adoption of the latest technologies into the telecommunication infrastructure are also crucial in this process.
(xii) Licensing: In parallel with ecosystem integration, the licensing gaps for hybrid deployment types need to be further explored. Many of the open source tools for data engineering are licensed under Apache 2.0, which does not imply vendor-specific licensing. On the other hand, legacy telecommunication infrastructures are based on various vendor-specific equipment. Avoiding vendor dependency helps enterprises to develop their own customized services and explore new opportunities and business goals. The gap between the interplay of open source and vendorlocked systems deployments is an ongoing issue and needs to be further explored.
(xiii) Synchronization aspects: In a traditional telecommunication system, one task may orchestrate multiple calls * https://osm.etsi.org/, accessed: December-2021 † https://www.onap.org/, accessed: December-2021 ‡ https://cloudify.co/, accessed: December-2021 to internal or external services. Telecommunications systems require strict synchronization between multiple components of the data engineering platform and the telecommunications infrastructure. If synchronous orchestration of services fails, the entire service flow fails. Some of these synchronization requirements also arise during data collection, model and hyperparameter updates when multiple actions need to be performed by the AI/ML platform between the interconnected network domains (e.g., joint actions performed on both the core network and the transport networks). For example, after the data connection and ingestion phases are completed, the extracted and transformed data must be continuously synchronized with the original data sources. However, since data sources can be heterogeneous and change dynamically over time, a data source may be out of sync and out of date at the time of integration. This can lead to discrepancies in data schema and definition and cause problems in synchronizing these heterogeneous data sources. The development of such solutions for telecommunication networks in the field of data engineering is still an open research area.
(xiv) Lack of rigorous methodology in networking: Throughout its evolution, networking has evolved both scientifically and through trial and error and configuration based deployments in real systems. As a result, there are complex interaction patterns among the components of telecommunication networks. For example, in most cases, the network is designed to be distributed and each node (router, switch, gateway, etc.) overlooks only a portion of its environment. This also makes it difficult to apply conventional approaches/algorithms from computer vision or Natural Language Processing (NLP) (which also use standardized datasets such as the MNIST (Modified National Institute of Standards and Technology) database for handwritten digits or the ImageNet database, etc.) for direct comparisons of learning or inference algorithms to the complex networking systems. Therefore, a more rigorous and scientific approach is required when designing AI/ML systems in the area of complex and large-scale telecommunication systems.
(xv) Hybrid approach to data operations: Note that all of the above analytics frameworks including data collection, data analysis, data monitoring, data visualization, etc. can be performed either at the device, network or global level as described in Section XI. However, depending on the use case, a hybrid approach may also be required, comprising a flexible and distributed analytics architecture where some necessary data processing is performed at the device level and/or some partial processing is performed at network the level.
Distributing some of the functionalities of these frameworks across these levels can help improve network performance by reducing bandwidth overhead or network latency (which can be helpful for real-time applications). In [204], the authors have shown the benefits of such a hybrid approach to reduce the cost of data communication while ensuring that the accuracy of decision making for IoT networks does not significantly decrease. A flexible placement strategy of different data analytics modules that can be dynamically selected, combined or switched to achieve the best I/O performance is also explored in [205]. The benefits of data orchestration of use case-based analytics for 5G scenarios are proposed in [206].
In the case of such a hybrid approach, a different data analysis setup can be created for different use cases (e.g., mMTC, eMBB or URLLC in 5G network slices). In a network slicing setup where data flow over the industry outside the industrial site (e.g., a factory) is not desired, the edge computing paradigm can be enabled. In this edge computing setup, for example, AI-based image processing for quality inspections can be performed at the edge instead of in cloud servers to reduce traffic and eliminate critical I/O performance bottlenecks. On the other hand, other network slices can continue to run their analytics modules on central servers. In the IoT data processing scenario, computationally intensive data training and inference generation can be performed at a global level (e.g., in the cloud). At the same time, data generated locally at the device level (e.g., from sensors) can be transformed and aggregated locally to save transmission energy and increase data protection while maintaining global accuracy as much as possible.
Another example of functionality distribution is via federated learning. In federated learning, with limited interaction between nodes in the network, local construction of AI/ML models can be instantiated within a single component/node of the network. At a later stage, these small models at the individual distributed nodes can be sent back to a network level coordinator to build a global model and view of the network domain. Finally, the global models can be sent back to the local devices/nodes to improve performance [207].
(xvi) Practical aspects versus system complexity: Another problem that is usually overlooked in the design of data systems is the increase of system complexity in practical systems, e.g. algorithms that are data hungry (increased amount of data) and require high-computational (excessive use of CPU, RAM or storage capacity in servers). Indeed, deep neural network architectures require complex structures and in many cases provide powerful results (e.g., high classification accuracy) that represent a trade-off between accuracy and computational cost (e.g., computational cost for inference, time for hyperparameter optimization). However, despite their high model performance metrics demonstrated in particular in the fields of NLP and computer vision, they also require a significant amount of computational resources and power (e.g. Convolutional Neural Networks (CNN) which rely on operators such as convolution, rectified linear unit (ReLU), pooling and classification) and larger systems such as multicore CPUs and GPUs for fast and accurate performance computation [208].
For this reason, in some use cases, the deployment of deep neural networks, especially on embedded and mobile devices (e.g., training a complex image classification model using local data on resource-constrained (in terms of energy and capacity) mobile devices) may be either expensive or not possible. Therefore, very deep neural networks may not be suitable for these scenarios, as they would compromise some performance metrics (e.g., accuracy). Instead, lightweight architectures that are less suitable for complex tasks should be chosen. This trade-off should be considered especially in resource-constrained smart environments [209]. As a solution, some advanced techniques and toolboxes can be used to deploy these complex DL models in mobile network applications (e.g., while compensating for small performance degradations) [10]. On the other hand, in some cases, e.g., when exploring tabular datasets, tree ensemble algorithms such as XGBoost can outperform deep neural network models in terms of accuracy, inference efficiency, and optimization time, as shown in [210], which also needs to be considered before increasing model complexity.
At the same time, note that adding new and more sophisticated data components can also slow down the entire process of data engineering pipeline in practical systems. In addition, new systems or components may poorly represent uncertainty, and may lack transparency and trust. Therefore, it is important to weigh the technical pros and cons of the benefits of purely research-based solutions when designing the entire data engineering pipeline in practical real-world systems.
(xvii) Cloud vs. on-premise infrastructure: When designing a data engineering pipeline, the different deployment options (e.g., cloud (public), on-premise (private), or hybrid cloud) and the corresponding trade-offs should be thoroughly analyzed. First of all, there are several advantages to using cloud services. For example, cloud services offer high availability, easy scalability, resilience, cost reductions, and easy accessibility when a product reaches a higher level. On the other hand, building an on-premise infrastructure can ensure that privacy, security and regulatory compliance for mission-critical services.
From a cost perspective, iteratively processing data in the data engineering pipeline (e.g., ML-based data analytics and processing frameworks) and running applications 24/7 in the cloud can be expensive compared to on-premise solutions. For this reason, in some scenarios, enterprises may be interested in taking advantage of both private and public clouds. Hybrid options can leverage the different features and characteristics of multiple platforms as well as traditional on-premise resources. For example, if the data load in one of the frameworks in the data engineering pipeline explodes, additional public resources in the cloud can be helpful until the data load levels drop back below a certain threshold. Hybrid options can also be beneficial for high availability and disaster recovery scenarios [211]. Dayto-day production systems can be maintained on-promise while a backup or recovery environment can be moved to the cloud to provide agility in a disaster recovery scenario.
xviii) Computing resources for training in wireless networks: Wireless networks also have their own chal-lenges, such as uncertainties in the environment (e.g., dynamic channel, security, congestion, interference, connectivity, network expansion, etc.), limited resources (e.g., transmit power, spectrum) or hardware constraints (e.g., computational power) that make training models difficult [35]. Mobile data is dynamic, distributed over a large geographic area, exhibits changing patterns over time, and has inherent characteristics associated with human mobility, location topology, local culture (e.g. events, festivals), etc. For example, the spatio-temporal behaviour of residents may differ significantly depending on the time of day or week [212]. Some of the devices (e.g., mobile devices) also have limited hardware capacities and cannot train complex ML/DL models with large datasets.
In complex and large architectures and environments such as 5G, powerful hardware and software are required to support both training and inference (as data volume and quality become increasingly important) if intelligence is to be built on top of the network infrastructure, as described in survey paper [10] and the articles referenced therein. Therefore, computational and time resources for training processes need to be considered when learning with large datasets especially in wireless applications where patterns change over time [212], [213]. When model training is performed with large distributed datasets on central servers, additional communication and storage costs are incurred and the solution does not scale. An elegant solution is to perform model execution on distributed nodes while ensuring good performance on local data and reducing the load on central servers (e.g., federated learning on wireless networks [214]). To stabilize the training process and accelerate convergence, the optimization process can also be updated as conditions change [215].
Finally Table 6 provides a summary of the gap analysis described above.

B. CHALLENGES
In order to reap the benefits of integrating data engineering ecosystem solutions at different layers of the telecommunication network infrastructure for both telecommunication providers and users, there are also some challenges that need to be overcome. Some of the challenges in putting together a data pipeline architecture are related to the following issues: (i) Inter-working between different programming languages, tools, computation runtimes: Developers and data engineers use a variety of tools and programming languages (Python, Java, Scala, R, Julia, SAS, etc.). Using multiple languages often increases the cost of effective testing and leads to difficulties in transferring responsibility to others. For example, some message queuing systems such as Rab-bitMQ [50] or Kafka [46] can be implemented in Java, some other data modules such as Apache Spark are written and work best in Scala programming language (there is also support for Java and Python), most of ML algorithms are better supported by the Python programming language and libraries, and user web applications can be written in the C# programming language.
For this reason, the field of data tools and systems is inherently heterogeneous, diverse and fragmented, as multiple workflows are involved in the process of creating data engineering pipelines. Therefore, supporting multiple languages and decoupling the components of the data engineering pipeline can be critical to reducing the overall complexity of the system and accommodating heterogeneity. Furthermore, it is desirable that entire layouts of existing frameworks in a data engineering pipeline are languageindependent and provide software abstractions. In summary, integration with other systems is an ever-growing area and requires overarching tools when data connections between different frameworks are required.
(ii) Choosing the right toolset: Big Data can be categorized under "7 Vs": volume, velocity, variety, variability, veracity, visualization and value [216]. So, depending on rhe use case and the different industry requirements, either one or more of these Vs may be important. For URLLC applications (e.g., telemedicine and autonomous driving) velocity and veracity, for eMBB applications (e.g., remote metering), volume, or for mMTC applications, veracity of data may be important parameters to optimize when selecting the appropriate data engineering tools from a variety of design solutions. At the same time, to manage the complex workflows and the needs of different stakeholders demanding various network services, a comprehensive list of tools, platforms and frameworks should be used based on the different characteristics of the data sources and the requirements of the data processing.
For example, in data ingestion and transformation, Apache Storm [55] can be used for high volume real-time data, Apache Nifi can be used for medium volume realtime data, and Sqoop can be used for batch data with low latency requirements. In addition, extensive comparisons of some of the latest message queueing systems (e.g., Kafka, RabbitMQ, RocketMQ, ActiveMQ, and Pulsar) have shown that Kafka can be used for higher throughput, RabbitMQ is more suitable for lower latency, while RocketMQ can provide both low latency and high quality of service for applications and services [68]. Some tools, such as Apache Druid, only allow querying a single data set, so joining with multiple other data sources is not possible. Since in such scenarios it is not an optimal combine all data sources into a single data source, e.g., due to the nature of the different services producing data, other custom tools such as Presto can be used for these purposes.
As another example, when developing a streaming application, there is an inherent trade-off between data quality and data speed. To provide a fault-tolerant and scalable system with an exactly-once-guarantees, various platforms such as Spark's Structured Streaming and Delta Lake can be used. For out-of-order data processing, Flink's data stream processing is an ideal candidate. Some OLAP solutions designed for Big Data such as ClickHouse itself, are only VOLUME 4, 2016 A balance between a data-driven architecture (that transfers large amounts of data between applications) and an application architecture (that ensures the functionality of one application is executed in response to a request from another application)

Gaps (III) Description (III) Gaps (IV) Description (IV) (xi) Ecosystem integration
The tools and libraries selected from the data engineering ecosystem should be well integrated with the broader telecommunication infrastructure based on the use cases and requirements.
(xv) Hybrid approach to data operations Depending on the use case, a hybrid approach is required, comprising a flexible and distributed analytics architecture where some necessary data processing is performed at the device level and/or some partial processing is performed at network the level.

(xii) Licensing
The gap between the interplay of open source and vendor-locked systems deployments is an ongoing issue and needs to be further explored (xvi) Practical aspects versus system complexity Consideration of a trade-off between accuracy and computational cost (e.g., computational cost for inference, time for hyperparameter optimization) when deploying data engineering solutions for telecom specific use cases. The different deployment options (e.g., cloud (public), on-premise (private), or hybrid cloud) and the corresponding trade-offs should be thoroughly analyzed when deploying data engineering pipelines.
(xiv) Lack of rigorous methodology in networking A more rigorous and scientific approach is required when designing AI/ML systems in the area of complex and large-scale telecommunication system (xviii) Computing resources for training in wireless network Computational and time resources for training processes need to be considered when learning with large datasets especially in wireless applications where patterns change over time designed for fast queries over large data set and do not support real-time record-by-record ingestion. Only after integration with a streaming platform such as Kafka is realtime data streaming possible, allowing ClickHouse to act as a message consumer. Therefore, depending on the required reliability of the request (either streaming or batch) and possible trade-offs in performance, data engineers need to choose different tools.
Given all these different options, it can be difficult to find a suitable set of tools for building a data pipeline. The choice depends on numerous factors, such as the analysis results of the pros and cons of the tools or the understanding of their suitability for the use cases under consideration. Ideally, the selected tools should not be tied to a specific vendor, should be supported by a large community, should have clear documentation, should be easy to integrate with the rest of the platform, and should be independent of various software, including cloud services and third-party vendors.
(iii) Support for containerization: There is a growing need for support for containerization to build flexible, service-oriented and cloud-native applications [217]. The general trend is to build services using infrastructures such as Kubernetes clusters (to enable production-grade container orchestration). A production system would run multiple machines, each with hundreds of containers that can be restarted, rescheduled or terminated at any time.
As an example of a scale-out architecture, one containerbased microservice can be exposed with REST-APIs over Hypertext Transfer Protocol (HTTP), another container can be accessed using Protobuf and gRPC, or another with realtime streaming requirements can expose its microservice via websocket APIs. Therefore, using frameworks such as Kubernetes to deploy containers/microservices provides flexibility in deployment, ease of automation, movement, and scaling.
On the other hand, although modern open source projects such as Pulsar, Spark or Flink provide native support for Kubernetes, there are still many components in the Hadoop ecosystem that have not moved away from YARN or do not provide standard support (e.g. Kafka). For example, automatically resizing jobs in a container (scaling up/down, scaling out/in) for stream processing jobs depending on lags or other performance parameters is also currently a challenging problem.
(iv) Lack of a unified framework for data processing and analysis: In a general data engineering pipeline, online and offline data processing are handled in separate pipelines, each using different computing engines such as Kafka, Spark Streaming, Flink, Hive, Map-Reduce, etc. However, this can add maintenance overhead for enterprise development teams. Inside telecommunication operator, there are a variety of data analytics nodes and tools deployed in various sub-units to perform customer experience management, service quality of service management, revenue assurance, or user/marketing analytics. On the other hand, it is a difficult task to integrate all these separate analytics nodes with traditional systems (e.g., with data visualization/notification applications for reporting, with network management and orchestration tools for service automation) in a single framework.
In data engineering, some frameworks such as Spark or Flink can provide a single, unified data engineering pipeline solution for both online real-time and offline data. However, to generalize application development, some other computational patterns such as distributed training, model serving, streaming, distributed data processing, distributed reinforcement learning, etc. need to be implemented as libraries in addition to these frameworks. Although there are frameworks that provide a unique set of abstractions and a unified APIs for both batch and stream processing jobs, consolidating an advanced data engineering pipeline cannot be achieved with a single unified framework to perform general distributed computation, online multi-stream processing, window operations, stateful analysis or DL simultaneously.
For example, Kafka's data ingestion benefits may outperform Spark Streaming data ingestion framework, while additional data processing such as multi-stream joins or generating additional features for online and offline data can be more effortlessly performed only with Spark and not with Kafka. Similarly, adding support for some libraries (e.g., a current DL framework) may be excluded from the mainstream development process due to lack of resources, suitable use cases or interest in the community (e.g., because industry needs are not yet mature enough) and because it is very time-consuming to append a new framework to the overall AI/ML stack.
(v) Use of multiple AI/ML frameworks: In many organizations, it is common to use multiple systems and frameworks for different workloads. For example, in data storage, a data lake, many data warehouses, custom specialty databases for graphs, streaming, time-series databases, etc. are common practice. At the same time, some of the emerging areas like DL are advancing very quickly and depending on the task at hand, different DL frameworks can be more effective. While it is easy to experiment with a new framework, it is very expensive to add production support for each new DL library. In cross-domain applications where many of these different computational frameworks or patterns need to be combined, serious challenges can arise. For example, in cases where reinforcement learning or some online learning applications require processing data streams, training and deploying models which may exceed the limits of specific purpose integrated systems. This, of course, increases complexity.
In practice, one way to overcome this problem is to find a way to connect the different frameworks together to create applications that are independent of any particular framework. Another way is to build a new system from scratch that can supports the functionalities of these frameworks with simple APIs for new algorithms, creating general purpose systems. However, these two ways have their own pros and cons. For example, when merging different systems, it is not efficient to move data between frameworks, which can lead to additional overhead and inflexibility (e.g., inferences drawn by the system cannot be updated frequently due to model update difficulties). In addition, the learning curve of all these different frameworks can be steep.
On the other hand, designing and developing a new system from scratch and moving to a new general-purpose system can require a great deal of engineering effort for new application development processes. Despite these challenges, many organizations of today are moving to develop their own internal data platforms consisting of a variety of open-source tools and frameworks, rather than relying on closed proprietary systems.
(vi) Data security and privacy: Legal compliance, encryption, key management and data governance & integrity are the main pillars of data security and privacy. If not managed properly, a large data set distributed across an enterprise can cause major headaches for data owners in terms of security, authentication, authorization and information integrity. With strict regulatory requirements (e.g., the GDPR (General Data Protection Regulation) in Europe) preventing data from being moved to the cloud, many organizations are looking for and investing in tools that can allow only authorized individuals to manage sensitive data on-premises. At the same time, data sources cannot always be trusted, which can lead to gaps in the system. For this reason, ensuring data security is also crucial in model training and validation. The accuracy and integrity of the data set must be ensured by avoiding data collection from faulty or compromised network nodes/users to protect against unfavorable data sets.
All data breaches must be detected as soon as possible. For this reason, data stream processing is ideal for developing security applications that allow to respond immediately. For example, in a typical enterprise, bots, scraper detection, or access monitoring are importance requirements that can be met with available stream processing platforms that provide state management and checkpointing capabilities. Real-time data should also comply with privacy regulations, similar to data stored in data warehouses, data lakes, or traditional data stores. Some companies such as Confluent are already offering new connectors, such as the Privitar Kafka connector which improves the value of streaming data assets without compromising user privacy * . Data stream processors such as Splunk DSP can also mask sensitive data.
(vii) Event streaming support: Traditional event streaming in OSS uses various protocols such as Simple Network Management Protocol (SNMP) for routers and service gateways, gRPC and protobuf (Google binary protocol buffer) for telemetry, or syslog events for soft switches for monitoring purposes. However, these various vendorspecific protocols also bring some challenges. Some of * https://www.confluent.io/hub/privitar/privitar-kafka-connector, accessed December-2021 them are: Complexity in real-time analysis, multiple data semantics & naming across different device types and data sources. In BSS systems, there are also various challenges related to different systems for broadband, mobile and fixed services, technology stacks (fiber, copper, 4G/5G, etc.), and other Value Added Services (VASs). Several BSS system components need to be extended to include services such as recommendations, augmented reality, payment integration, etc. and integration with legacy middleware components of CRM systems (ETL, Enterprise Service Bus (ESB)) also needs to be done.
Together with data ingestion frameworks, these disparate data streams can be normalized to a common schema facilitating real-time analysis and display of the global network infrastructure. Moreover, data ingestion frameworks can be used to achieve asynchronous communication between components and interoperability between different service providers of OSS and BSS systems. On the other hand, streaming support for model serving purposes is an important feature when selecting data processing frameworks. Although most model serving applications are based on REST, it is not desirable to use REST inside streaming applications and make many calls outside the execution environment. For this reason, new libraries (such as Flink Tensorflow [218]) are gradually emerging that can support streaming model serving.
(viii) Architecture decisions: Various data architecture decisions for streaming and batch processing involve tradeoffs, and organizations are willing to choose the one that offers more flexible scaling, lower operational overhead with high availability and reliable performance. Some important considerations when choosing a data architecture are scalability, operability (which is more difficult with stream processing jobs due to potential lags), bridging both offline and online scenarios (especially useful for applications using active learning) and ease of data access and movement capabilities (due to the inherent semantics differences in the data). Different data architectures are available depending on the use case and SLAs. Table 7 summarizes the descriptions and drawbacks of these different available architectures.
(ix) Operational complexity: Selecting data engineering systems that can work in a single system reduces operational complexity. However, there will not be a single platform, system or compute runtime that can handle all the entire underlying heterogeneous data engineering infrastructure. This is because there are different data types (big data or small data, graph or log data, etc.) and access patterns (streaming or parallel) in the data landscape or ecosystem. Also, the introduction of new technologies requires new, well-trained people who can handle the sheer growth of the technology stack.
However, managing and deploying data tools is getting easier by the day. The latest data engineering tools greatly abstract and simplify workflows, allowing data engineers to focus on selecting the simplest and most cost-effective solutions that deliver the greatest value to the business. As

Architecture Options Description & Advantages Drawbacks
Lambda -Suggested by Nathan Maz and is utilized to support systems that require both streaming and batch pipelines [219].
-The aim is to make streaming systems to aid batch systems to be as close as possible to real-time.
-Enables reliable streaming pipeline establishments.
-May cause maintenance problems due to two separate codebases (one for streaming and another for batch) which may need to be maintained for consistency during software updates and fixes together.
-Increased complexity, cost as well as inaccuracy due to potential loss or duplication of data. -May cause system integration and data centralization difficulties.
-Not suitable for use cases requiring correct and low-latency results.

Kappa
-Suggested by Jay Kreps and addresses some of the challenges and limitations of Lambda architecture, -Replay the data from a structure data source into a stream (e.g turning tables into unbounded stream) to provide unified processing capabilities.
-Unifies both batch and streaming pipelines under a unified codebase and facilitates the use of batch and streaming data to drive business innovation.
-Requires fast stream processing engine -Not efficient storage for large data sets -Less efficient processing capabilities for batch -Limited cloud native support a result, these incremental developments in the data tooling landscape are expected to significantly reduce the operational complexity of deploying future data architectures. (x) Batch computing challenges: Most frameworks that rely on batch data ETL using SQL or SQL-like functionality can be difficult to integrate when complex logic is required compared to simple low-level programming. Batch computing queries become problematic when resources are limited because aggregations are not additive when new elements are added to the computed results. As a result, batch computing can also be inflexible and difficult to manage, which can lead to errors.
When using third party solutions to improve efficiency, utilization, and performance, some dependencies and versions (e.g., jobs that depend on Spark or Hive versions) can be difficult to integrate due to heterogeneity in workloads (e.g., analytic, transactional), infrastructure (e.g., cloud, onpremises), deployments (e.g., Kubernetes, bare-metal nodes, custom Platform as a Service (PaaS)) and data pipeline environment, increasing operational overhead. To overcome these challenges, frameworks such as Kubernetes with its containerized approach can be used.
(xi) System stability: System stability depends on both data consistency and the longevity of the systems used. If the same data engineering pipeline cluster is used by multiple use cases, the entire cluster may become unstable during sudden traffic spikes. For example, a throughput-intensive application may impact or slow down the data availability of another application or service if the pipelines are not adequately planned.
Solving data consistency problems caused by multiple systems is a difficult engineering challenge. Data inconsistencies can lead to data loss or duplication of data. Bringing these events to a consistent state requires additional effort from the data and operations engineering teams. As a platform for data orchestration and management, the multitenancy support in Kubernetes aims to enable such isolation and fair resource sharing between multiple use cases so that their workloads can be reliably shared in a single cluster. However, this approach should also be extended to the E2E components of the entire data engineering pipeline.
(xii) Data sharing: Many telecommunication service providers struggle to understand and identify what data should be shared, while ensuring regulatory compliance and defining/understanding the standardized interfaces for data sharing. Implementing a secure and distributed approach to data sharing between different network nodes is critical. In future networks, for example, many telecommunication service providers will have to share part of their infrastructure with each other. This will also force them to share critical infrastructure-related information for better network management and orchestration [220].
Traditional APIs used for data sharing can perform poorly (a slow-working API can be a bottleneck for the service) or converting legacy services to an API-based service can be costly. Monolithic databases where every user retrieves the data can lead to a single point of failure and scalability issues. Big Data transfers can also have consistency issues due to the longer duration of data transfers when data is shared.
To address these data sharing issues in a scalable way and improve the quality of data sharing with third-party vendors or internal departments of an organization, investments can be made in building standardized interfaces for accessing relevant data and event-based applications and architectures for complex events. Another possible solution for data sharing is to integrate the latest developments in blockchainbased systems into telecommunication networks [221] so that intelligence and data can be shared between the owners of the individual network domains in a secure and reliable manner.
(xiii) Coexistence with non-AI/ML capable systems: In a traditional telecommunications infrastructure, not all deployed equipment will be intelligent. In some cases, the AI/ML-enabled systems will need to interact with non-AI/ML-enabled systems. For example, in some situations, some of the UEs/edge devices may be used for model training while others act as normal mobile/edge devices. In this scenario, the AI/ML platform should be able to distinguish AIML-enabled nodes. This can help AI/ML nodes to participate in distributed model training or model service and ensure unexpected interventions (e.g., interference, traffic, congestion, etc.) by non-AI/ML nodes.
Moreover, human-motivated actions or misleading behaviours that can be performed on nodes not operated by AI/ML may negatively affect the learning process of AI/MLenabled nodes. On the other hand, in some cases, especially during the autonomous learning process (e.g., during the exploration phase of reinforcement learning algorithms), the actions performed by AI/ML-enabled devices are unreliable. To avoid unexpected behaviour of telecommunication systems in this case, non AI/ML-enabled nodes can be used until the learning process is successfully completed.
(xiv) Small dataset: One of the components for building a data engineering pipeline is the data storage ecosystem, in which HDFS plays a key role. However, HDFS also brings practical limitations in storing a large amount of small data. For this reason, the size of files that need to be stored in databases, or the amount of data that can be sent to a web service that feeds the data into the database must be adjusted accordingly via configuration parameters. As a result, applications that rely on HDFS (e.g., Spark jobs) slow down because the applications spend most of its time on I/O operations instead of focusing on data processing or analysis aspects. A possible solution to this problem in Hadoop would be to store data in SequenceFile format where each small file is stored in a larger single file.
(xv) Testing distributed systems: Testing a distributed system with multiple components operating in spatially separated locations is usually more difficult and complicated. In a typical data engineering workflow used to develop a system, determining where the system fails requires additional investigative work as the components involved in the data pipeline grow larger. For a typical system, several proven types of traditional software testing must be performed before the system is put into production. Similar procedures can be used when testing data engineering pipelines. These procedures are: (i) unit testing (to test a small part or subset of the functionality of a data engineering component), (ii) regression testing (to reproduce the previously found bugs and fix), (iii) integration testing (to test the system, when individual data engineering components are integrated with each other), (iv) E2E testing (to test full functionality of a data engineering system in a staging environment) and (v) stress testing (to test the scalability limits of the data engineering system on a large scale, e.g., number of users supported, data traffic support, number of commands executed, etc.).
(xvi) Lack of standardization: There are many datadriven telecommunication use cases in the standardization community, but no implementation details for real-world telecommunication network environments (e.g., in 5G and beyond). This can lead to significant challenges in building robust data engineering pipelines when extending enterprise building blocks. Although some strategies are presented in [222], they are limited to procedures of BDA techniques in IT with limited applications in the network domain. However, extensive standardization efforts are needed to generalize these concepts for used by a large community.
(xvii) Backlogged data pipelines: A common challenge for all pipelines is the delays that occur in the ingestion pipeline. Note that in telecommunication networks, especially in 5G and beyond mobile networks, SLAs are very strict when time-critical AI/ML-based decision making processes need to be made. Thus, if any component of the data pipeline fails, it can lead to serious SLA misses. In networks, for example, the packet transmission times are on the order of milliseconds and the inference time of the developed models should be an order of magnitude shorter, otherwise there will be overhead as traffic increases.
(xviii) Limited data availability: Unavailability of sufficient data for AI/ML training and model building purposes is a major challenge in almost all industrial use cases [223]. Every time new data emerges, it also brings new knowledge and hence needs to be integrated into the training process. This is also true for the telecommunication industry. At the same time, recent advances such as semi-supervised learning, federated learning or active learning can help to create larger training datasets or to introduce new knowledge into the training process.
For a good summary of existing ML approaches that work with limited data see [224]. Among these algorithms, semi-supervised learning aims to train ML models that use both labelled and unlabeled data [225]. These methods use a large amount of unlabeled data and proportional lack of labelled data to achieve an optimal result. For example, the labelling of unlabelled examples can be done by a semisupervised algorithm based on their proximity to known labelled examples. The main advantage of this approach is that it generates additional labelled data that can be used to train the ML model. Therefore, semi-supervised learning is particularly beneficial for scenarios where more training data is needed.
The concept of federated learning aims to distribute the copies of the ML algorithm to the distributed sites/devices where the data is kept, perform the training iterations locally, and finally send the computational results (e.g., updated neural network weights) to the central repository to update the main algorithm [214]. The main advantage is that the data remains with the owner and the algorithms can still be trained on the distributed data. Active learning aims to reduce the amount of data required for human labelling [226]. In this learning method, a query-based strategy is used to select the most informative examples that a human operator can label. Once new examples are labelled, the ML model is updated based on the newly labelled examples and this process is repeated to train the model and improve performance.
In telecommunications, unlabeled instances can be selected for active labeling. For example, when it is difficult to obtain enough labelled network fault data to find the root cause of faults in cellular networks, an active learning strategy can be used [227], [228]. In semi-supervised learning cases, auto-encoder based approaches can be used to find the root cause of faults in cellular networks [184]. Finally, federated learning in wireless communication allows each UE to build local federated learning models based on their local measurements and send them to BSs to build a global federated learning model [229].
Finally Table 8 provides a summary of the challenges described above.

C. FUTURE DIRECTIONS AND ROAD AHEAD
Today, we have enormously large datasets, increased computing power (GPUs, cloud, etc.), extensive open source software tools and increased industry investment as well as a large community developing new data science/engineering applications and services. Similarly, data applications are attracting large scale number of users and the data engineering ecosystem has the potential to support a larger number of users. The involvement of telecommunication industry in data value chain would provide strategic business value to telecommunication infrastructure and service providers. For this reason, telecommunication providers are looking forward to interacting with and benefiting from the data engineering ecosystem more frequently as it is open source, royalty-free and community-driven.
At the same time, there is still much to be done to develop, deploy, enable operation, debug/test and extend the data applications within the telecommunication infrastructure. It is critical to identify the applications, services, and products that will benefit most from transformations of data engineering in the networks and IT organizations of telecommunications providers. When designing a data The field of data tools and systems is inherently heterogeneous, diverse and fragmented, as multiple workflows are involved in the process of creating data engineering pipelines.

(vi) Data security and privacy
A large data set distributed across an enterprise can cause major headaches for data owners in terms of security, authentication, authorization, privacy and information integrity.
(ii) Choosing the right toolset It can be difficult to find a suitable set of tools for building a data pipeline given different use case requirements (vii) Event streaming support Traditional event streaming in OSS rely on various vendor-specific protocols which also bring their own challenges.
(iii) Support for containerization In a typical data engineering workflow used to develop a system, determining where the system fails requires additional investigative work as the components involved in the data pipeline grow larger.
(xii) Data sharing Implementing a secure and distributed approach to data sharing between different network nodes is challenging

(xvi) Lack of standardization
Many data-driven telecommunication use cases in the standardization community, but no implementation details for real-world telecommunication network environments (xiii) Coexistence with non-AI/ML capable systems In cases when the AI/ML-enabled systems will need to interact with non-AI/ML-enabled system, the AI/ML platform should be able to distinguish AI/ML-enabled nodes

(xvii) Backlogged data pipelines
Delays that occur in the ingestion pipeline can cause problems in SLAs in telecommunication networks when time-critical AI/ML-based decision making processes need to be made.
(xiv) Small dataset The size of files that need to be stored in databases, or the amount of data that can be sent to a web service that feeds the data into the database must be adjusted accordingly (xviii) Limited data availability Unavailability of sufficient data for AI/ML training and model building purposes is a major challenge engineering pipeline in an enterprise based on use case requirements, both telecommunication and data engineering experts should be consulted, as their perspective on each use case is different and shared ideas can be of great benefit. Data infrastructure is already undergoing a significant architectural change [230]. Traditional data warehouses are moving from on-premise to cloud-based data warehouses to increase scalability, wide expansion, flexibility and ease of use (e.g. e-commerce data migration case to Google Biq-Query * ), and next-generation Data Lakes are beginning to include more ACID-like features and interactive SQL query capabilities (e.g. Presto [85]). More flexible and consistent Extract-Load-Transform (ELT) pipelines are taking the place of traditional ETL processes, (e.g., dbt † ). In the area of data management and orchestration, several hundred data pipelines are orchestrated using dataflow automation tools (e.g., with AirFlow, Dagster ‡ ). Tools like Superset help provide self-service insights (reports, dashboards, etc.) and are also accessible to nontechnical users. Our survey results shows that telecommunication providers can move beyond the traditional boundaries of telecommunication networks (e.g., RAN or OSS/BSS operations, etc.) to reap the benefits of deploying data engineering frameworks on an evolving data infrastructure. They can leverage the power of data engineering systems deployed in a distributed, scalable and optimized architecture for their own business needs. This would also result in lower Operating Expenditures (OPEX) and Capital Expenditures (CAPEX), simplify network deployment and management, and ensure high customer satisfaction in addition to improved value-added services.
For optimized network management and orchestration, the coexistence of the data engineering frameworks described above with traditional systems at different layers of the network infrastructure is critical. The integration of the tools and frameworks of the data engineering ecosystem should serve as a complement to the traditional systems. For example, if the deployed data processing and analysis framework is not able to adequately handle the dynamic changes in the network environment, existing non-AI/MLbased solutions (e.g., predefined hand-crafted, and rulebased approaches that do not consider model-based approaches and take reactive actions based on human experience) can meet the new requirements. This coexistence is also important for security reasons. Enabling such hybrid approaches can help ensure rapid response in production environments.
Typically, individual service technologies in the telecommunications world are implemented by multiple vendors and devices running on their network are usually locked-in and expensive. Compared to telecommunication infrastructure, the data engineering infrastructure is young, innovative, * https://cloud.google.com/blog/products/data-analytics/e-commercedata-warehouse-migration, accessed December-2021 † https://www.getdbt.com/, accessed December-2021 ‡ https://dagster.io/, accessed December-2021 and growing rapidly. Data engineering technologies and platforms are evolving and improving at a rapid pace. In addition, most of the innovative and disruptive technologies being developed in the data engineering community are being released as open source. For this reason, telecommunication systems must be prepared to adopt and deal with these new data engineering technologies rather than remain in legacy systems that cannot take advantage of the data. For example, by starting with a simple, small E2E data engineering pipeline in a production environment, rather than working on a more complex data pipeline, can help to avoid numerous mistakes, detect errors early, and solve integration issues with traditional telecommunication infrastructure. In addition, AI/ML solutions do not have to found for every task and every problem. In many cases, simple solutions such as rule-based systems instead of ML systems can also help to find an intermediate solution. These results can be used later and iterated step by step to collect more data needed for more complex data engineering solutions. Moreover, the functionality of the data engineering pipeline can be progressively extended during this process through a series of iterations. For new and untested frameworks, it makes sense for organizations to use public cloud resources first (e.g., AWS, GCP or Azure) and then move to onpremises resources once a stable definition of workload pipeline is in place.
AI/ML algorithms are predicted to be integrated into telecommunication networks in the next decade. There are more and more number of use cases for real-time data, and the systems that process this data should be mature for the requirements of telecommunication systems. Network management and orchestration based on data engineering can be used to track evolving traffic patterns, user behaviour, etc., and take these trends into account in the planning and operational phases. It is important to choose a modular system that can cover multiple use cases.
As a starting point, there is no need to reinvent the wheel, as there is a good chance that an existing tool/framework can support initial efforts to integrate AI/ML systems into the telecommunication specific applications (both in IT and network). Familiarity with the data engineering ecosystem and tools/frameworks, diversity of expertise across technologies, and integration skills in bringing disparate pieces together into new telecommunication applications will be very useful for network-focused product and service development teams. Moving forward, a few useful questions to consider are: • How easily can a new framework or approach be integrated and tested on large scale in the telecommunication infrastructure? • How accurately can the impact of the new changes/updates in the data engineering pipeline be measured in telecommunication infrastructure to avoid system complexity, poor resource utilization or degradation of the KPIs for a particular service?
• Does the improvement of a framework in the pipeline affect or degrade other components in the data engineering pipeline and telecommunication infrastructure while keeping maintenance tasks at a low level? • How does the addition of each framework in the data engineering pipeline in an integrated environment impact an organization's policy standards (e.g., data storage, compliance and regulations, security, network, operations, management, data traffic flow, servers, workloads, legacy applications, or reporting)? • How quickly can the network engineers of the telecommunication world and the data engineers of the data engineering world would be brought together to accelerate the process?

XIII. CONCLUSIONS
The data engineering ecosystem will inevitably play an important role in next generation network management and orchestration systems. In this tutorial paper, we highlight the recent advances in data engineering based networks to meet the needs of network management and orchestration. We first provide a comprehensive analysis of existing frameworks and platforms, and then focus on recent standardization activities. Finally, we discuss the gaps, challenges, and future directions in building a data engineering-oritented networking system for telecommunication networks. Our tutorial analysis shows that data engineering frameworks can be used for a variety of purposes, ranging from data ingestion to data visualization, enabling telecommunication network operators to leverage the data generated by their users, environment, or network equipment.