Building a Data Pipeline with Azure Functions

Herve Roggero
Sep 24, 2018
3 min read

This blog post discusses how Enzo Data Pipelines was built on top of Azure Functions, a serverless environment, to address scalability objectives. The architecture presented in this blog was implemented as a technology made available as part of Enzo Online (https://portal.enzounified.com).

You can learn more and even try Enzo Data Pipelines by reading a lab that guides you through the steps of implementing your own data pipeline: https://bit.ly/2xmmj9S. For more information contact hroggero@enzounified.com.

Data Pipeline Considerations

Building data pipelines involves key design objectives, including read and write strategies, a decoupled staging environment, and replay logic for Change Data Capture (CDC) scenarios. While read and write strategies are more concerned with system API abstraction, the decoupled write strategy allows an independent write mechanism that can scale without affecting the source system.

For a more in-depth overview of data pipeline considerations, please see my previous post on Scalable Cloud Data Synchronization.

Data Pipeline Implementation with Azure Functions

Now that we have established the key foundational capabilities of a data pipeline, let’s see how we can use Azure Functions and other services to implement it. The following diagram shows the conceptual implementation for using SharePoint Online as the source system, and three destination systems:

The Enzo component in the middle of the diagram represents the actual data pipeline engine, which includes the read strategy, the data store, the write strategies and the replay logic. As you can see, the destination systems are very different in nature; a cloud service, a relational database and a messaging service. The ability to decouple the various destinations with different write strategies makes implementing this architecture extensible.

Why Azure Functions?

Azure Functions are used to implement a serverless runtime of the data pipeline engine, so that if multiple systems must be read or written to at the same time, multiple instances of the function will be started. This means that the data pipeline engine itself does not need to worry about multi-threading concerns.

In the above example, since the data will be sent to three distinct destination systems, up to three instances of the Azure Function will be running at the same time. The serverless environment eliminates a key scalability concern: managing concurrency. If N source systems must be read at the same time, and P destination systems must be written to at the same time, all with various error/retry logic and timeouts, N + P instances of the Azure Function can be executed in parallel automatically.

Implementation Architecture

The following diagram shows a simplified implementation of the Enzo Data Pipeline architecture using Azure Functions. Note that Azure Storage is also used to provide state management and orchestration.

The physical implementation can be done by creating two separate functions (a read function and a write function), or as a single function that is smart enough to perform both operations. The architecture of Azure Functions ensures proper scalability regardless of the implementation choice.

The read strategy operates independently from the write strategy, and stores data changes in blobs (Azure Storage). When the read strategy completes, it sends a message in an Azure Queue to wake up the write strategies that have been configured; one message per write strategy. At that point each write strategy executes at roughly the same time, running as an Azure Function.

If an error occurs with one of the destination system, or if an administrator wants to replay a section of the CDC log, the retry logic comes into play; an administrator can simply re-execute the CDC log at will by specifying a point in time from which the log should be replayed. For certain scenarios, such as temporary unavailability of a destination system (such as a database being unavailable for a few minutes), the retry logic will automatically attempt to replay the log without the need for manual intervention.

Summary

This blog post discusses how Azure Functions were used to implement a scalable data pipeline architecture using key design principles discussed in a previous blog post. In addition to using Azure Functions, the solution uses other Azure capabilities such as Azure Storage for its inner workings.

This high level architecture was implemented with a technology called Enzo Data Pipelines, built on top of Enzo Online. You can learn more and even try Enzo Data Pipelines by reading a lab that guides you through the steps of implementing your own data pipeline: https://bit.ly/2xmmj9S

For more information contact hroggero@enzounified.com.

Event-Based SQL Server Integration

Forget the SharePoint SDK

Replicate SharePoint Online Lists to Azure SQL Database