Azure Data Factory

  • Author: Ronald Fung

  • Creation Date: 1 June 2023

  • Next Modified Date: 1 June 2024


A. Introduction

In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. However, on its own, raw data doesn’t have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision makers.

Big data requires a service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that’s built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.

Usage scenarios

For example, imagine a gaming company that collects petabytes of game logs that are produced by games in the cloud. The company wants to analyze these logs to gain insights into customer preferences, demographics, and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop compelling new features, drive business growth, and provide a better experience to its customers.

To analyze these logs, the company needs to use reference data such as customer information, game information, and marketing campaign information that is in an on-premises data store. The company wants to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud data store.

To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight), and publish the transformed data into a cloud data warehouse such as Azure Synapse Analytics to easily build a report on top of it. They want to automate this workflow, and monitor and manage it on a daily schedule. They also want to execute it when files land in a blob store container.

Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

Additionally, you can publish your transformed data to data stores such as Azure Synapse Analytics for business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.


B. How is it used at Seagen

As a biopharma research company using Microsoft Azure, you can use Azure Data Factory to create and manage data pipelines that move and transform data between different sources and destinations. Here are some ways you can use Azure Data Factory:

  1. Data integration: Azure Data Factory allows you to integrate data from multiple sources, such as databases, files, and cloud storage services, into a single location, making it easier to manage and analyze.

  2. Data transformation: Azure Data Factory enables you to transform data using a variety of built-in and custom transformations, allowing you to standardize, clean, and prepare data for analysis.

  3. Data movement: Azure Data Factory can move data between on-premises and cloud-based data sources, allowing you to leverage the scalability and flexibility of the cloud while maintaining control over your data.

  4. Orchestration: Azure Data Factory provides a centralized location for managing data pipelines, allowing you to monitor and manage data flows across your organization.

  5. Integration with Azure services: Azure Data Factory integrates with other Azure services, such as Azure Data Lake Storage, Azure Synapse Analytics, and Azure Databricks, allowing you to easily move and transform data across your Azure environment.

  6. Improved productivity: Azure Data Factory can improve productivity by reducing the time and effort required to move and transform data, allowing your team to focus on more important tasks.

  7. Security: Azure Data Factory provides built-in security features, such as role-based access control and integration with Azure Active Directory, ensuring that your data is properly secured and protected.

Overall, Azure Data Factory provides a powerful and flexible tool for creating and managing data pipelines, making it easier for your biopharma research team to integrate, transform, and move data across your organization. By leveraging the security, scalability, and performance of the service, you can streamline data workflows, improve productivity, and gain insights from your data more effectively.


C. Features

Azure Data Factory is a cloud-based data integration service that allows you to create and manage data pipelines that move and transform data between various sources and destinations. Here are some of the key features of Azure Data Factory:

  1. Data integration: Azure Data Factory enables you to integrate data from multiple sources, including on-premises and cloud-based data sources, into a single location, making it easier to manage and analyze.

  2. Data transformation: Azure Data Factory provides a variety of built-in and custom transformations that allow you to transform data, including data cleaning, data type conversions, and data aggregations.

  3. Data movement: Azure Data Factory can move data between on-premises and cloud-based data sources, allowing you to leverage the scalability and flexibility of the cloud while maintaining control over your data.

  4. Orchestration: Azure Data Factory provides a centralized location for managing data pipelines, allowing you to monitor and manage data flows across your organization.

  5. Integration with Azure services: Azure Data Factory integrates with other Azure services, such as Azure Data Lake Storage, Azure Synapse Analytics, and Azure Databricks, allowing you to easily move and transform data across your Azure environment.

  6. Improved productivity: Azure Data Factory can improve productivity by reducing the time and effort required to move and transform data, allowing your team to focus on more important tasks.

  7. Security: Azure Data Factory provides built-in security features, such as role-based access control and integration with Azure Active Directory, ensuring that your data is properly secured and protected.

  8. Flexible scheduling: Azure Data Factory enables you to schedule data pipelines to run at specific times or intervals, allowing you to automate data workflows and processes.

  9. Monitoring and alerts: Azure Data Factory provides monitoring and alerting capabilities, allowing you to monitor the performance and health of your data pipelines and receive alerts when issues occur.

  10. Custom code: Azure Data Factory allows you to create custom code to extend the functionality of the service, using languages such as Python, .NET, and Java.

Overall, Azure Data Factory provides a powerful and flexible tool for creating and managing data pipelines, making it easier to integrate, transform, and move data across your organization. By leveraging the security, scalability, and performance of the service, you can streamline data workflows, improve productivity, and gain insights from your data more effectively.


D. Where Implemented

LeanIX


E. How it is tested

Testing Azure Data Factory involves verifying that the data pipelines you have created and configured are working as expected. Here are some steps you can take to test Azure Data Factory:

  1. Verify configuration: Verify that Azure Data Factory is properly configured and integrated with your Azure account and resources.

  2. Test data integration: Test Azure Data Factory by creating data pipelines that integrate data from multiple sources, such as databases, files, and cloud storage services, and verifying that data is properly integrated.

  3. Test data transformation: Test the data transformation capabilities of Azure Data Factory by creating data pipelines that transform data using built-in and custom transformations, and verifying that data is properly transformed.

  4. Test data movement: Test the data movement capabilities of Azure Data Factory by creating data pipelines that move data between on-premises and cloud-based data sources, and verifying that data is properly moved.

  5. Test orchestration: Test the orchestration capabilities of Azure Data Factory by monitoring and managing data pipelines across your organization, and verifying that data flows are properly orchestrated.

  6. Test integration with Azure services: Test the integration capabilities of Azure Data Factory by integrating it with other Azure services, such as Azure Data Lake Storage, Azure Synapse Analytics, and Azure Databricks, and verifying that data can be moved and transformed effectively.

  7. Test productivity: Test the productivity benefits of Azure Data Factory by verifying that the service reduces the time and effort required to move and transform data, allowing your team to focus on more important tasks.

  8. Test security: Test the security capabilities of Azure Data Factory by ensuring that data is properly secured and protected, and that access is controlled through role-based access control and integration with Azure Active Directory.

Overall, testing Azure Data Factory involves verifying that the data pipelines you have created and configured are working as expected. By testing Azure Data Factory, you can ensure that you are effectively using the service to move and transform data across your organization, and that you are benefiting from the security, scalability, and performance it provides.


F. 2023 Roadmap

????


G. 2024 Roadmap

????


H. Known Issues

Like any software or service, there may be known issues or limitations with Azure Data Factory that users should be aware of. Here are some of the known issues with Azure Data Factory:

  1. Limited customization: Azure Data Factory has limited customization options, which can limit the ability of users to configure the service to their specific needs.

  2. Limited durability: Azure Data Factory does not provide persistent storage options, which can limit the ability of users to store and manage data across multiple pipelines.

  3. Limited monitoring and logging: Azure Data Factory has limited monitoring and logging capabilities, which can limit the ability of users to monitor and troubleshoot their pipelines.

  4. Cost: Azure Data Factory can be expensive for users with limited budgets, particularly if they manage large volumes of data or use the service frequently.

  5. Security and compliance concerns: Users must ensure that they are properly securing and protecting their data when using Azure Data Factory, particularly when managing data with sensitive data or data subject to regulatory compliance requirements.

  6. Limited data transformation capabilities: Azure Data Factory has limited data transformation capabilities compared to other ETL tools, which can limit the ability of users to transform data in complex ways.

  7. Limited data source support: Azure Data Factory may not support all data sources, which can limit the ability of users to integrate data from certain sources.

Overall, while Azure Data Factory offers a powerful and flexible tool for creating and managing data pipelines, users must be aware of these known issues and take steps to mitigate their impact. This may include carefully configuring the service to meet the specific needs of their data, carefully monitoring the performance and cost of the service to ensure that it is a good fit for their data requirements, and carefully integrating the service into their existing workflows to ensure that it is effectively utilized. By taking these steps, users can ensure that they are effectively using Azure Data Factory to manage their data pipelines, and that they are benefiting from the security, scalability, and performance it provides.


[x] Reviewed by Enterprise Architecture

[x] Reviewed by Application Development

[x] Reviewed by Data Architecture