Google Dataflow

  • Author: Ronald Fung

  • Creation Date: 9 June 2023

  • Next Modified Date: 9 June 2024


A. Introduction

Dataflow is a fully managed service for running stream and batch data processing pipelines. It provides automated provisioning and management of compute resources and delivers consistent, reliable, exactly-once processing of your data. Dataflow supports many data processing use cases, including stream analytics, real-time AI, sensor and log data processing, and other workflows involving data transformation.


B. How is it used at Seagen

Seagen can use Google Cloud Dataflow to process large amounts of data in a scalable and efficient way. Here are some steps to get started with Google Cloud Dataflow:

  1. Create a Google Cloud account: Seagen can create a Google Cloud account in the Google Cloud Console. This will give them access to Google Cloud Dataflow and other Google Cloud services.

  2. Create a Dataflow job: Seagen can create a Dataflow job in the Google Cloud Console, which represents a data processing pipeline. They can specify the job name, pipeline options, and other job settings.

  3. Configure the data pipeline: Seagen can configure the data pipeline to read data from Azure, using Azure Blob Storage or Azure Data Lake Storage, and process it using Google Cloud Dataflow. They can use a variety of data sources, such as CSV files, JSON files, or Parquet files, and a variety of processing transforms, such as filtering, aggregating, or joining.

  4. Run the Dataflow job: Seagen can run the Dataflow job, using the Google Cloud Console or the Dataflow API. They can monitor the job progress, view the job logs, and troubleshoot any issues that arise.

  5. Analyze the output: Seagen can analyze the output of the Dataflow job, using different tools and services, such as BigQuery or Google Cloud Storage. They can visualize the data using different charts and graphs, and derive insights from the data to inform their research.

Overall, by using Google Cloud Dataflow, Seagen can process large amounts of data in a scalable and efficient way, and derive insights that can help them accelerate their research. With its support for different data sources, powerful processing transforms, and easy-to-use interface, Google Cloud Dataflow is an excellent choice for businesses and individuals who need to process large amounts of data quickly and efficiently.


C. Features

Autoscaling of resources and dynamic work rebalancing

Minimize pipeline latency, maximize resource utilization, and reduce processing cost per data record with data-aware resource autoscaling. Data inputs are partitioned automatically and constantly rebalanced to even out worker resource utilization and reduce the effect of “hot keys” on pipeline performance.

Flexible scheduling and pricing for batch processing

For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.

Ready-to-use real-time AI patterns

Enabled through ready-to-use patterns, Dataflow’s real-time AI capabilities allow for real-time reactions with near-human intelligence to large torrents of events. Customers can build intelligent solutions ranging from predictive analytics and anomaly detection to real-time personalization and other advanced analytics use cases.


D. Where Implemented

LeanIX


E. How it is tested

Testing Google Cloud Dataflow involves ensuring that the data processing pipeline is working correctly, and that the output is as expected. Here are some steps to test Google Cloud Dataflow:

  1. Create a test data source: Create a test data source that mimics the production data source as closely as possible, including the data format, schema, and metadata.

  2. Create a test data processing pipeline: Create a test data processing pipeline that mimics the production pipeline as closely as possible, including the pipeline transforms, options, and parameters.

  3. Run the data processing pipeline: Run the test data processing pipeline, using the Google Cloud Console or the Dataflow API. Monitor the pipeline progress, view the pipeline logs, and troubleshoot any issues that arise.

  4. Verify the output: Verify the output of the test data processing pipeline, using different tools and services, such as BigQuery or Google Cloud Storage. Ensure that the output is as expected, and that it matches the input data in terms of format, schema, and metadata.

  5. Repeat the process: Repeat the process as needed, creating additional test data sources and processing pipelines to test different data formats or to simulate different data processing scenarios.

Overall, by thoroughly testing Google Cloud Dataflow, users can ensure that their data processing pipeline is reliable, scalable, and capable of handling large volumes of data. Additionally, users can reach out to Google Cloud support for help with any technical challenges they may encounter.


F. 2023 Roadmap

????


G. 2024 Roadmap

????


H. Known Issues

While Google Cloud Dataflow is a reliable and scalable data processing system, there are some known issues that users may encounter. Here are some of the known issues for Google Cloud Dataflow:

  1. Performance issues: Users may encounter performance issues with Google Cloud Dataflow, such as slow job execution times or high resource utilization. These issues can often be resolved by optimizing the data processing pipeline, such as using the appropriate transforms or adjusting the pipeline options.

  2. Data consistency issues: Users may encounter data consistency issues with Google Cloud Dataflow, such as data corruption or data loss. These issues can often be resolved by using the appropriate data sources, such as durable storage systems, and implementing data validation and error handling mechanisms.

  3. Resource allocation issues: Users may encounter resource allocation issues with Google Cloud Dataflow, such as insufficient resources or resource contention. These issues can often be resolved by using the appropriate resource allocation policies, such as dynamic resource allocation or preemptible VMs.

  4. Monitoring and debugging issues: Users may encounter monitoring and debugging issues with Google Cloud Dataflow, such as incomplete logs or inaccurate metrics. These issues can often be resolved by using the appropriate monitoring and debugging tools, such as Stackdriver or Cloud Trace.

  5. Billing and cost issues: Users may encounter billing and cost issues with Google Cloud Dataflow, such as unexpected charges or incorrect usage reports. These issues can often be resolved by reviewing usage reports and monitoring billing statements in the Google Cloud Console.

Overall, while these issues may impact some users, Google Cloud Dataflow remains a reliable and scalable data processing system that is widely used by businesses and individuals. By monitoring their Google Cloud Dataflow usage and reviewing their usage reports and logs, users can ensure that their data processing resources are secure and accessible, and that they are only paying for the resources they use. Additionally, users can reach out to Google Cloud support for help with any known issues or other technical challenges they may encounter.


[x] Reviewed by Enterprise Architecture

[x] Reviewed by Application Development

[x] Reviewed by Data Architecture