Scalable Cloud Data Synchronization
This blog post discusses key architecture considerations when building a scalable cloud data synchronization system (also referred to as a data pipeline), based on lessons learned from building a Change Data Capture (CDC) technology called Enzo Data Pipelines, built on top of Enzo Online (https://portal.enzounified.com).
You can learn more and even try Enzo Data Pipelines by reading a lab that guides you through the steps of implementing your own data pipeline: https://bit.ly/2xmmj9S. For more information contact email@example.com.
The Need for Data Pipelines
A data pipeline is a technology that synchronizes data from a source system to one or more destination systems, so that the data gets copied easily and automatically without user intervention, and more importantly without writing a single line of code.
In many cases, data is stored in external systems, such as an Online CRM tool, or even generated by devices, or regularly changing such as weather data or even Twitter feeds. Regardless of the source of data, the current source data may need to be copied to another system; once the initial data has been copied, ongoing changes also need to be forwarded for processing. I refer to the initial copy of the source data as the “Initial Sync”, and ongoing changes as “Change Data Capture”, or CDC in short.
There are many reasons for implementing Initial Sync and/or CDC data pipelines, including:
Exporting data from a source system where it is difficult to analyze/report on
React to changes in the source system to trigger actions in other systems
Monitor for undesired changes or critical modifications for security reasons
Implementing a “dumb pipe, smart endpoint” architecture for loosely integrated systems
Regardless of the reason, be it systems integration, reporting or event-driven architecture, creating a data pipeline can be hard.
Data Pipeline Architecture Considerations
A data pipeline system needs to have a few key capabilities from a functional standpoint: read strategies, a standardized staging store, decoupled write strategies, and a replay mechanism. For simplicity I am ignoring non-functional objectives, such as security, multi-tenancy and error management. Let’s review the functional architecture requirements in more detail.
A read strategy will be necessary to abstract how to read the source system itself, both for the full Initial Sync and for the CDC itself. Some source systems may not necessarily offer an Initial Sync; for example a stream of flight schedules is a “forward” only data stream, and may not necessarily provide past data. Other systems can provide an Initial Sync; for example the Initial Sync for SharePoint Online would be reading a complete List.
The read strategy may also behave differently to account for system-specific behavior. For example SharePoint Online only allows reading up to 4,000 items at a time, while an Azure Table only allows up to 1,000 items. As a result, a read strategy also need to implement a paging mechanism used while reading the full data set.
The read strategy needs to work at specific intervals, or on demand, so that changes to the source system are forwarded automatically. A simple timer may work, such as every hour; in some cases it may be necessary to implement a web hook to trigger the pipeline through other mechanisms.
Last but not least, each source system has its own authentication mechanism and API/SDK layer; this means that a read strategy needs to be semantically aware of the source system it abstracts.
Standardized Staging Store
Once the data is read, it needs to be stored in a standardized method, so that it can be easily read later. For example an XML document could be used to store the source data, both the Init Sync and the CDC data, in a staging environment. While an XML document contains both the data and the schema (the data type), a JSON document is also good fit as long as the schema of the source data is stored separately.
An important aspect of the staging store is to use a date/time stamp on the data being read so that it can be later replayed from a specific point in time. As a result, storing staging data in blobs using a timestamp for the name of the blob can help with the replay mechanism discussed later.
Decoupled Write Strategies
A write strategy provides a level of abstraction for a destination system; sending data to a database, or to an Azure Queue is very different. Certain optimization techniques can be used for relational databases for example, such as using a bulk import.
In addition, since the source data is stored in a staged environment, it is possible to completely decouple the write strategy from the read strategy; in other words, it becomes possible to send the source data to multiple destination systems, without affecting the source system. For example, a Twitter feed could be sent to a database table and to an Azure Service Bus. This specific architecture implementation provides scalability since any number of destination systems can be implemented without impacting the source system; and as it turns out this architecture can easily be implemented with Azure Functions as we will see later.
Last but not least, a replay mechanism allows you to replay changes from a specific point in time, or from the beginning of time (the Initial Sync). The replay capability allows you to resend the source data, without reading from the source system again (since the data comes from the staging environment). If the data is indexed properly using a timestamp in the staging store, playing back the CDC data becomes relatively easy.
Data Pipeline Architecture
From an overall implementation standpoint, the above considerations become specific technology components that can offer the above stated objectives, as shown in the diagram below. This architecture provides both strong decoupling and a blueprint for system scalability.
This blog post provides an overview of data pipelines and discusses key design objectives of such system, including read and write strategies, a decoupled staging environment, and discusses the concept of replay logic for CDC scenarios.
While read and write strategies are more concerned with system API abstraction, the decoupled write strategy allows an independent write mechanism that can scale without affecting the source system.
This high level architecture was implemented with a technology called Enzo Data Pipelines, built on top of Enzo Online. You can learn more and even try Enzo Data Pipelines by reading a lab that guides you through the steps of implementing your own data pipeline: https://portal.enzounified.com/Docs/Labs.html#docLab004
For more information contact firstname.lastname@example.org.