Google Dataproc

  • Author: Ronald Fung

  • Creation Date: 9 June 2023

  • Next Modified Date: 9 June 2024


A. Introduction

Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at scale, integrated with Google Cloud, at a fraction of the cost.


B. How is it used at Seagen

Seagen can use Google Cloud Dataproc to process large amounts of data in a scalable and efficient way. Here are some steps to get started with Google Cloud Dataproc:

  1. Create a Google Cloud account: Seagen can create a Google Cloud account in the Google Cloud Console. This will give them access to Google Cloud Dataproc and other Google Cloud services.

  2. Create a Dataproc cluster: Seagen can create a Dataproc cluster in the Google Cloud Console, which represents a managed Hadoop or Spark cluster. They can specify the cluster name, machine types, and other cluster settings.

  3. Configure the cluster: Seagen can configure the cluster to read data from Azure, using Azure Blob Storage or Azure Data Lake Storage, and process it using Google Cloud Dataproc. They can use a variety of processing tools, such as Hadoop, Spark, or Hive, and a variety of data sources, such as CSV files, JSON files, or Parquet files.

  4. Run the Dataproc job: Seagen can run the Dataproc job, using the Google Cloud Console or the Dataproc API. They can monitor the job progress, view the job logs, and troubleshoot any issues that arise.

  5. Analyze the output: Seagen can analyze the output of the Dataproc job, using different tools and services, such as BigQuery or Google Cloud Storage. They can visualize the data using different charts and graphs, and derive insights from the data to inform their research.

Overall, by using Google Cloud Dataproc, Seagen can process large amounts of data in a scalable and efficient way, and derive insights that can help them accelerate their research. With its support for different processing tools, powerful data processing capabilities, and easy-to-use interface, Google Cloud Dataproc is an excellent choice for businesses and individuals who need to process large amounts of data quickly and efficiently.


C. Features

Fully managed and automated big data open source software

Serverless deployment, logging, and monitoring let you focus on your data and analytics, not on your infrastructure. Reduce TCO of Apache Spark management by up to 54%. Enable data scientists and engineers to build and train models 5X faster, compared to traditional notebooks, through integration with Vertex AI Workbench. The Dataproc Jobs API makes it easy to incorporate big data processing into custom applications, while Dataproc Metastore eliminates the need to run your own Hive metastore or catalog service.

Containerize Apache Spark jobs with Kubernetes

Build your Apache Spark jobs using Dataproc on Kubernetes so you can use Dataproc with Google Kubernetes Engine (GKE) to provide job portability and isolation.

Enterprise security integrated with Google Cloud

When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. Additionally, some of the most commonly used Google Cloud-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and customer-managed encryption keys (CMEK).

The best of open source with the best of Google Cloud

Dataproc lets you take the open source tools, algorithms, and programming languages that you use today, but makes it easy to apply them on cloud-scale datasets. At the same time, Dataproc has out-of-the-box integration with the rest of the Google Cloud analytics, database, and AI ecosystem. Data scientists and engineers can quickly access data and build data applications connecting Dataproc to BigQuery, Vertex AI, Cloud Spanner, Pub/Sub, or Data Fusion.


D. Where Implemented

LeanIX


E. How it is tested

Testing Google Cloud Dataproc involves ensuring that the cluster and the processing jobs are working correctly, and that the output is as expected. Here are some steps to test Google Cloud Dataproc:

  1. Create a test data source: Create a test data source that mimics the production data source as closely as possible, including the data format, schema, and metadata.

  2. Create a test Dataproc cluster: Create a test Dataproc cluster that mimics the production cluster as closely as possible, including the cluster configuration, machine types, and other settings.

  3. Configure the cluster: Configure the test Dataproc cluster to read data from the test data source and process it using the appropriate processing tool, such as Hadoop or Spark. Ensure that the cluster is properly configured and that the processing jobs are running correctly.

  4. Run the processing job: Run the test processing job, using the Google Cloud Console or the Dataproc API. Monitor the job progress, view the job logs, and troubleshoot any issues that arise.

  5. Verify the output: Verify the output of the test processing job, using different tools and services, such as BigQuery or Google Cloud Storage. Ensure that the output is as expected, and that it matches the input data in terms of format, schema, and metadata.

  6. Repeat the process: Repeat the process as needed, creating additional test data sources and processing jobs to test different data formats or to simulate different processing scenarios.

Overall, by thoroughly testing Google Cloud Dataproc, users can ensure that their data processing cluster is reliable, scalable, and capable of handling large volumes of data. Additionally, users can reach out to Google Cloud support for help with any technical challenges they may encounter.


F. 2023 Roadmap

????


G. 2024 Roadmap

????


H. Known Issues

While Google Cloud Dataproc is a reliable and scalable data processing system, there are some known issues that users may encounter. Here are some of the known issues for Google Cloud Dataproc:

  1. Performance issues: Users may encounter performance issues with Google Cloud Dataproc, such as slow job execution times or high resource utilization. These issues can often be resolved by optimizing the cluster configuration, such as using the appropriate machine types or adjusting the cluster settings.

  2. Data consistency issues: Users may encounter data consistency issues with Google Cloud Dataproc, such as data corruption or data loss. These issues can often be resolved by using the appropriate data sources, such as durable storage systems, and implementing data validation and error handling mechanisms.

  3. Resource allocation issues: Users may encounter resource allocation issues with Google Cloud Dataproc, such as insufficient resources or resource contention. These issues can often be resolved by using the appropriate resource allocation policies, such as dynamic resource allocation or preemptible VMs.

  4. Monitoring and debugging issues: Users may encounter monitoring and debugging issues with Google Cloud Dataproc, such as incomplete logs or inaccurate metrics. These issues can often be resolved by using the appropriate monitoring and debugging tools, such as Stackdriver or Cloud Trace.

  5. Billing and cost issues: Users may encounter billing and cost issues with Google Cloud Dataproc, such as unexpected charges or incorrect usage reports. These issues can often be resolved by reviewing usage reports and monitoring billing statements in the Google Cloud Console.

Overall, while these issues may impact some users, Google Cloud Dataproc remains a reliable and scalable data processing system that is widely used by businesses and individuals. By monitoring their Google Cloud Dataproc usage and reviewing their usage reports and logs, users can ensure that their data processing resources are secure and accessible, and that they are only paying for the resources they use. Additionally, users can reach out to Google Cloud support for help with any known issues or other technical challenges they may encounter.


[x] Reviewed by Enterprise Architecture

[x] Reviewed by Application Development

[x] Reviewed by Data Architecture