Google Dataprep

  • Author: Ronald Fung

  • Creation Date: 9 June 2023

  • Next Modified Date: 9 June 2024


A. Introduction

Dataprep by Trifacta enables you to explore, combine, and transform diverse datasets for downstream analysis.

Within an enterprise, data required for key decisions typically resides in various silos. It comes in different formats, featuring different types. It is often inconsistent. It may require refactoring in some form for different audiences. All of this work must be done before you can begin extracting information valuable to the organization.

Data preparation (or data wrangling) has been a constant challenge for decades, and that challenge has only amplified as data volumes have exploded.


B. How is it used at Seagen

Seagen can use Google Cloud Dataprep to prepare and clean their data before processing it using other Google Cloud services, such as Google Cloud Dataproc or BigQuery. Here are some steps to get started with Google Cloud Dataprep:

  1. Create a Google Cloud account: Seagen can create a Google Cloud account in the Google Cloud Console. This will give them access to Google Cloud Dataprep and other Google Cloud services.

  2. Create a Dataprep project: Seagen can create a Dataprep project in the Google Cloud Console, which represents a workspace for data preparation. They can specify the project name, region, and other project settings.

  3. Import the data: Seagen can import the data into Dataprep, using a variety of data sources, such as CSV files, JSON files, or Excel files. They can specify the data source location, format, and schema.

  4. Prepare the data: Seagen can prepare the data using Dataprep’s intuitive interface and powerful data preparation tools. They can clean the data, remove duplicates, transform the data, and enrich the data with additional information.

  5. Export the data: Seagen can export the prepared data to other Google Cloud services, such as Google Cloud Storage or BigQuery, for further processing. They can specify the export format, destination, and other export settings.

Overall, by using Google Cloud Dataprep, Seagen can prepare their data for further processing using other Google Cloud services, and ensure that the data is clean, accurate, and enriched with additional information. With its support for different data sources, powerful data preparation capabilities, and easy-to-use interface, Google Cloud Dataprep is an excellent choice for businesses and individuals who need to prepare large amounts of data quickly and efficiently.


C. Features

Predictive transformation

Dataprep uses a proprietary inference algorithm to interpret the data transformation intent of a user’s data selection. A ranked set of suggestions and patterns for the selections to match are automatically generated.

Rich transformations

Leverage hundreds of transformation functions to turn your data into the asset you want. With a click of a mouse, apply aggregation, pivot, unpivot, joins, union, extraction, calculation, comparison, condition, merge, regular expressions, and more.

Optimized processing throughput

Dataprep automatically selects the best underlying Google Cloud processing engine to transform the data as fast as possible. Based on the data locality and volume, Dataprep leverages BigQuery (in-place ELT transforms) to prepare the data, Dataflow, or for small volumes Dataprep’s in-memory engine.

Active profiling

See and explore your data through interactive visual distributions of your data to assist in discovery, cleansing, and transformation. Visual representations help interpret large volumes of data, and Dataprep’s innovative profiling techniques visualize key statistical information in a dynamic, easy-to-consume format.

Data quality rules

Data quality rules suggest data quality indicators to monitor and remediate the accuracy, completeness, consistency, validity, and uniqueness of the data, ensuring that you have a comprehensive view of the cleanliness of your data.

Collaboration

In team environments, it can be helpful to be able to have multiple users work on the same assets or to create copies of good quality work to serve as templates for others. Dataprep enables users to collaborate on the same flow objects in real time or to create copies for others to use for independent work.

Comprehensive connectivity

In addition to BigQuery, Cloud Storage, Microsoft Excel, and Google Sheets standard connectivity, enrich your self-service analytics with hundreds of data sources such as Salesforce, Oracle, Microsoft SQL Server, MySQL, PostgreSQL, and many more.

Data pipeline orchestration

Schedule and automate your data preparation jobs by chaining them together in sequential and conditional order. Alert users of success or failure, and trigger external tasks (such as Cloud Functions). Leverage comprehensive APIs to integrate Dataprep as part of an enterprise’s end-to-end solution.

Enterprise-scale operationalization

Adopt a continuous deployment practice with recipe import/export across editions and versions, flow parameters, custom configuration for Dataflow or BigQuery, performance tuning, and advanced APIs to automate software development life cycles and monitoring.

Common data types

Transform structured or unstructured datasets stored in CSV, JSON, relational table formats, or SaaS application data of any size—megabytes to petabytes—with equal ease and simplicity.

Pattern matching

Utilize columnar pattern matching to identify data patterns of interest to you and to surface them in the interface for use in building your recipes. Additionally, in your recipe steps, you can apply regular expressions or Dataprep patterns to locate patterns and transform the matching data in your datasets.

Standardization

Group values by similarities based on spelling or language-independent pronunciation and create standardized clusters of consistent values.

Sampling

For performance optimization, Dataprep automatically generates one or more samples of the data for display and manipulation in the client application. However, you can easily change the size of samples, the scope of the sample, and the method by which the sample is created.

Advanced security

Expand on current security standards by providing individual data access control using a combination of Google IAM roles and BigQuery, Cloud Storage, and Google Sheets access rights to determine access.


D. Where Implemented

LeanIX


E. How it is tested

Testing Google Cloud Dataprep involves ensuring that the data preparation steps are working correctly, and that the output is as expected. Here are some steps to test Google Cloud Dataprep:

  1. Create a test data source: Create a test data source that mimics the production data source as closely as possible, including the data format, schema, and metadata.

  2. Create a test Dataprep project: Create a test Dataprep project that mimics the production project as closely as possible, including the project configuration, region, and other settings.

  3. Import the data: Import the test data into Dataprep, using the same data source location, format, and schema as the production data.

  4. Prepare the data: Prepare the test data using the same data preparation steps as the production data, such as cleaning, transforming, and enriching the data.

  5. Verify the output: Verify the output of the test data preparation steps, using different tools and services, such as Google Cloud Storage or BigQuery. Ensure that the output is as expected, and that it matches the input data in terms of format, schema, and metadata.

  6. Repeat the process: Repeat the process as needed, creating additional test data sources and preparation steps to test different data formats or to simulate different data preparation scenarios.

Overall, by thoroughly testing Google Cloud Dataprep, users can ensure that their data preparation pipeline is reliable, scalable, and capable of handling large volumes of data. Additionally, users can reach out to Google Cloud support for help with any technical challenges they may encounter.


F. 2023 Roadmap

????


G. 2024 Roadmap

????


H. Known Issues

While Google Cloud Dataprep is a reliable and powerful data preparation tool, there are some known issues that users may encounter. Here are some of the known issues for Google Cloud Dataprep:

  1. Performance issues: Users may encounter performance issues with Google Cloud Dataprep, such as slow data processing times or high resource utilization. These issues can often be resolved by optimizing the project configuration, such as using the appropriate machine types or adjusting the project settings.

  2. Data consistency issues: Users may encounter data consistency issues with Google Cloud Dataprep, such as data corruption or data loss. These issues can often be resolved by using the appropriate data sources, such as durable storage systems, and implementing data validation and error handling mechanisms.

  3. Resource allocation issues: Users may encounter resource allocation issues with Google Cloud Dataprep, such as insufficient resources or resource contention. These issues can often be resolved by using the appropriate resource allocation policies, such as dynamic resource allocation or preemptible VMs.

  4. Connectivity issues: Users may encounter connectivity issues with Google Cloud Dataprep, such as network errors or authentication failures. These issues can often be resolved by configuring the appropriate network settings and ensuring that the user has the appropriate permissions.

  5. Billing and cost issues: Users may encounter billing and cost issues with Google Cloud Dataprep, such as unexpected charges or incorrect usage reports. These issues can often be resolved by reviewing usage reports and monitoring billing statements in the Google Cloud Console.

Overall, while these issues may impact some users, Google Cloud Dataprep remains a reliable and scalable data preparation tool that is widely used by businesses and individuals. By monitoring their Google Cloud Dataprep usage and reviewing their usage reports and logs, users can ensure that their data preparation resources are secure and accessible, and that they are only paying for the resources they use. Additionally, users can reach out to Google Cloud support for help with any known issues or other technical challenges they may encounter.


[x] Reviewed by Enterprise Architecture

[x] Reviewed by Application Development

[x] Reviewed by Data Architecture