Designing Data Pipelines | part of Official Google Cloud Certified Professional Data Engineer Study Guide | Wiley Data and Cybersecurity books | IEEE Xplore

Designing Data Pipelines


Chapter Abstract:

This chapter reviews high‐level design patterns, along with some variations on those patterns. It also reviews how GCP services like Cloud Dataflow, Cloud Dataproc, Cloud...Show More

Chapter Abstract:

This chapter reviews high‐level design patterns, along with some variations on those patterns. It also reviews how GCP services like Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer are used to implement data pipelines. A data pipeline is an abstract concept that captures the idea that data flows from one stage of processing to another. Data pipelines are modeled as directed acyclic graphs (DAGs). A graph is a set of nodes linked by edges. A directed graph has edges that flow from one node to another. Data pipelines may have multiple nodes in each stage. For example, a data warehouse that extracts data from three different sources would have three ingestion nodes. Not all pipelines have all stages. A pipeline may ingest audit log messages, transform them, and write them to a Cloud Storage file but not analyze them. It is possible that most of those log messages will never be viewed, but they must be stored in case they are needed. Log messages that are written to storage without any reformatting or other processing would not need a transformation stage. Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.
Page(s): 61 - 88
Copyright Year: 2020
Edition: 1
ISBN Information:

Contact IEEE to Subscribe