Data Distribution - Featured on request

datadoubleconfirm
Apr 25, 2021
12 min read

Updated: May 10, 2021

This is an excerpt from chapter 11 of Data Engineering on Azure, with some edits to keep it self-contained. Notably this article omits some details and setup scripts included in the book to actually implement some of the data sharing patterns described.

---

Also, as part of content partner collaboration, we have a 35% discount code for readers of Data Double Confirm (good for all our products in all formats): nldatadbl21

You can refer to this link: http://mng.bz/w00g.

---

For this article, we’ll be using a very broad definition of data distribution: any consumption of data from our data platform outside of our team. This includes other individuals and teams, and integration with other systems. In many cases, data platforms start with a handful of reports, which can easily be shared via email or a service like Power BI.

This is perfectly fine in the initial stages and no additional engineering is required. Problems arise once the platform grows and more parts of the business take dependencies on it. Figure 1 shows this evolution.

Figure 1: A data platform starts small but over time grows to support more and more reports, websites, supply data to other teams and people. Scaling out becomes an important engineering challenge.

Once we reach a certain scale, we can’t efficiently satisfy all these data needs with a single data fabric: some systems require very low latency (for example serving data to a website), some require high throughput (copying large datasets), while we need to ensure we have enough compute resources to support our internal workloads like data processing, analytics, and machine learning.

At this stage, we need to implement solutions to support downstream consumption without impacting our ongoing workloads. Broadly speaking, there are 2 patterns for consuming data: low-volume/high-frequency and high-volume/low-frequency.

Low-volume/high-frequency data consumption means consuming small amounts of data (usually one or a few records) at very high frequency. This is common for websites.

An example of this is data for a website, for example showing a user their order history. While we may have hundreds of thousands of users, a website just needs to retrieve the order history of one of them at a given point in time (when the user wants to view it). This is a low-volume request, but potentially high-frequency – multiple users can request similar data through the website, so for a high-traffic website, requests could come in very often.

Another example of this pattern is consuming ML model predictions. As users browse our website, we will show them recommendations from our ML model. Again, this is low-volume (predictions for just the current user), but high-frequency (multiple users might be browsing the website at the same time, so many requests can come in a short timespan).

This type of data consumption is best served by a data API, which we’ll discuss in the next section.

The other common type of data consumption is high-volume/low-frequency.

High-volume/low-frequency data consumption means copying large datasets (GBs or TBs of data), at a low frequency (daily, weekly etc.). This is common for downstream data platforms ingesting our data, or downstream systems that want to perform additional batch processing.

If another data team wants to consume our data, they would usually perform this type of load. Similarly, if a team wants to reshape the data before using it, they might want to copy it in bulk first.

For example, as an alternative to our team maintaining an API that integrates with a customer-facing website, the team specializing in running the website might simply copy our data in bulk, then optimize for serving web traffic.

This type of data consumption is best served by sharing data using a storage solution that doesn’t include compute.

Figure 2 shows the two patterns side-by-side.

Figure 2: For low-volume/high-frequency requests, we can share data via a data API our downstream can call. For high-volume/low-frequency requests, we can share data by placing it in storage and letting downstream systems pick it up from there.

There are 2 combinations which we didn’t cover since they are less common: low-volume/low-frequency and high-volume/high-frequency. Low-volume/low-frequency is not a big concern; this usually means occasionally refreshing a report. This can usually be absorbed by our systems at no great cost. High-volume/high-frequency is in the realm of streaming data, usually in the world of IoT sensors or other live signals. In these cases, data would usually be consumed from the event stream directly, which sits upstream of our data platform. There are rare situations in which we would process a large volume of data in real-time and also have to serve it to a downstream system, which requires specialized infrastructure outside the scope of this book. For this type of workloads, I recommend looking into Azure Stream Analytics[1] and Azure Event Hubs[2].

We’ll start by looking at the low-volume/high-frequency scenario and talk about data APIs.

Building a data API

First, let’s define data API. By data API we mean an HTTPS endpoint which clients can call to retrieve data. This can be implemented as an Azure App Service or an Azure Function App. The protocol can be anything – REST, GraphQL, or something else.

Using an API between our consumers and our storage layer brings a set of advantages:

· We can use this middleware to control what data we expose and to whom – In general, even with a solid security model, if we get too granular, it comes extremely hard to maintain (for example having a security group for each table). Someone getting access that way will have access to multiple datasets in the system, which is desirable in some cases, like a data scientist from another team exploring the data and trying to create a new report, but might not be desirable in others. Especially when integrating with other systems, they might start to consume datasets which we weren’t planning to maintain or plan to modify. An API allows us to add another layer of control and abstraction on top of what we expose.

· The API abstracts away the storage – External systems are no longer tightly coupled with our storage layout. We are free to move things around, switch data fabrics, etc.

· We can get great insights on how the data is used – If we simply hand off the data in bulk to another system, there is no way to tell how that system uses our data. With an API, on the other hand, we can keep track of how many records are being requested, who requests them etc. This gives us good insights into how our data is being leveraged.

· We can optimize for low-volume/high-frequency – Since consumers no longer connect directly to our storage layer, we can make things a lot more efficient: we can copy our data to a data fabric better suited for this access pattern and, if needed, throw in a Redis cache.

We have scenarios in which our data is consumed in low-volume/high-frequency – many requests reading small bits of data. A common scenario for this pattern is when serving data to a website or some other service. We can use a data API for these scenarios.

Data APIs allow us to add another layer of control, abstract away storage, add caching if needed, and get insights on who is consuming our data. All this goodness comes at the cost of us having to maintain a web service.

Cosmos DB[3] is a document database service optimize to serve as a backend for an API. Some of its key features include guaranteed uptime and low-latency for queries and turn-key geo-replication – with a simple configuration change, data gets replicated in different regions to support APIs deployed around the globe.

A Cosmos DB account has a set of databases. A database has a set of containers. Containers store documents. Cosmos DB can use multiple query languages to access data, including a subset of SQL, MongoDB and Cassandra APIs, Apache Gremlin (a graph query language) and Azure Table API.

We can use Azure Data Factory to transfer data from our other data fabrics to Cosmos DB and serve it from there. We can use the Cosmos DB SDK to query data from code. The SDK is available for multiple languages, including Python, C#, Java and JavaScript.

[1] https://azure.microsoft.com/en-us/services/stream-analytics/ [2] https://azure.microsoft.com/en-us/services/event-hubs/

[3] https://azure.microsoft.com/en-us/free/cosmos-db/

Sharing data for bulk copy

In the previous section we looked at Cosmos DB, a storage service optimized for serving individual documents with minimum latency. This works great when our data ends up backing a website or feeds into some other service that consumes it item by item.

There is another set of scenarios in which a downstream team or service wants to consume whole datasets from us. This could be another data science team that wants to bring our data to their platform, or it could be a team that wants to do some additional processing and reshape our data before using it for their scenario.

While we can, of course, create an API to serve data in bulk, this is not optimal once we reach a certain scale. In these cases, we don’t need to optimize for low latency, rather we need high throughput. Another key factor, which we alluded at before, is we don’t want to share compute resources with the downstream team. Let’s expand on that.

Separating compute resources

Network, storage, and compute are a common way to categorize cloud service based on their function. Storage is all about data at rest – maintaining data in various formats. Compute is all about processing data and performing calculations.

Some services we look at throughout the book are purely storage, like Azure Data Lake Storage[1]. Some are purely compute, for example Azure Databricks[2] and Azure Machine Learning[3] can connect to storage services but don’t store data themselves, rather they provide capabilities to process data. Others have a combination of storage and compute, like most database engines. Azure SQL, Cosmos DB, and Azure Data Explorer[4] are examples of this: they both store data within the service and provide processing capabilities (we can run queries, joins etc. over the stored data).

From the perspective of sharing data with other teams, purely storage solutions are the easiest: if we have a big dataset in Azure Data Lake Storage, we can grant permissions to the downstream principals to read the data and we’re done.

Things get a bit more complicated for services that include compute. Let’s take Azure Data Explorer as an example. Azure Data Explorer is deployed on a set of virtual machines, the cluster’s SKU. We specify the size of VM we want and the number of nodes. We don’t have to deal ourselves with VM management, the Azure Data Explorer handles this for us, but notice there is a set of compute resources we are dealing with. If we simply grant access to other teams to our cluster, when their automation tries to retrieve a large data set, this can impact other queries running on the cluster. There is a limit on how much CPU we have available and how much bandwidth to move data. An expensive query can eat up enough resources on the cluster to make everyone else’s queries run slow or timeout. That’s why we don’t want to share data directly from this type of services.

If our data service includes both storage and compute, if we need to share data, we must ensure not to also share compute.

We have two ways to achieve this. One is to simply copy the data to a better-suited storage. For example, we can copy the data from Azure Data Explorer to a Data Lake. Of course, this copy can also impact other processing running on the cluster, but now this is fully in our team’s control. We can ensure this copy happens at a time where the cluster isn’t under heavy load, and we can ensure we don’t do it more often than necessary.

Another way to ensure we don’t share compute resources is to use a replica. Most database engines support this in one shape or another. A replica is a copy of the data but copying and ensuring consistency is handled by the database engine itself. This is highly optimized since the database engine knows how to best create this. Instead of us having to create an Azure Data Factory pipeline, we can simply configure our database to provision a replica.

Azure Data Explorer allows a cluster to follow databases from other clusters. A follower database is a read-only replica of the leader database. Data gets replicated automatically, with low latency. Figure 3 shows a cluster following a database from another cluster.

Figure 3: An Azure Data Explorer cluster contains a set of databases, including a leader database. Another cluster contains another set of databases, including a follower of the leader database, which is a read-only replica. Replication is handled automatically.

When we are sharing data for bulk copy, an API is not the best option: we want to serve it from a service optimized around reading large amounts of data. While we could share our data platform storage directly, this is not always optimal. It works for storage services that don’t involve compute, like Azure Blog Storage and Azure Data Lake Storage, where we can grant permissions to other teams to read the data.

If we do the same thing for storage solutions that include compute, like Azure SQL or Azure Data Explorer, we run the risk of external queries (queries outside of our control) interfering with the workloads running on our platform. To avoid this, we need to offload external queries to replicas. For example, Azure Data Explorer supports follower databases. These are read-only replicas of a database queried on another cluster’s compute.

Sharing data includes granting permissions, provisioning resources for the receiver etc. Azure Data Share[5] is an Azure service that specializes in sharing data regardless of data fabric. An Azure Data Share account can send and receive shares, both in-place or a snapshot copy (depending on the data fabric). Shares consist of one or more datasets. Azure Data Share offers a great abstraction over multiple data fabrics and is the recommended way to share data for bulk copy scenarios.

[1] https://azure.microsoft.com/en-us/services/storage/data-lake-storage/ [2] https://azure.microsoft.com/en-us/services/databricks/ [3] https://azure.microsoft.com/en-us/services/machine-learning/ [4] https://azure.microsoft.com/en-us/services/data-explorer/

[5] https://azure.microsoft.com/en-us/services/data-share/

Data sharing best practices

Before wrapping up, let’s cover a few best practices to keep in mind beyond the low-volume/high-frequency and high-volume/low-frequency patterns we covered through this article.

There is one important tradeoff to keep in mind which we haven’t discussed so far: network costs. Besides storage and compute, network is another important piece of the cloud infrastructure, and is not free. Moving large volumes of data through our platform incurs costs in terms of network bandwidth. Keep this in mind when creating a new ETL pipeline, as an additional data point to evaluate your architecture. It is great to have a dataset in multiple data fabrics, each optimized for a particular workload, but copying that dataset around is not free. Don’t think of this as going against what we just covered in this article, rather that this is an opposing force which creates tension in our system and something we shouldn’t completely ignore.

Don’t ignore the cost of copying data. This creates tension with replicating a dataset across different data fabrics to optimize for different workloads.

Another best practice to keep in mind is to not provide pass-through data. What we mean by this is our data platform will ingest datasets from upstream. If a dataset is available upstream, we shouldn’t share it ourselves with a downstream team, rather point them to the source. If we don’t enhance the data in any way, our data platform simply becomes an extra hop, potential source of failure, and introduces latency. We end up supporting an additional scenario with no real added value. For these situations, when there is a data request, redirect it to the upstream team that provides the same dataset to our platform.

Don’t share pass-through data: If a dataset is available upstream and we ingest it into our platform, redirect requests for the dataset upstream to the source.

Finally, we have scenarios where a downstream team doesn’t want to yet commit to consume data from our platform, but they would still like access to it. For example, they want to create a prototype before committing to supporting a production ETL pipeline, or they want to simply create a couple of Power BI reports, for which it is not worth copying large datasets around.

For these situations, we can still isolate the compute to protect the core workloads of our data platform: We can create a replica just as if we were sharing the data with another team, but we can manage the replica and grant access to it to multiple other teams. Figure 4 shows the setup for Azure Data Explorer.

Figure 4: We have a main Azure Data Explorer cluster for our platform’s workloads. We maintain a low-end cluster for data exploration. This protects our workloads from external queries. Other teams can explore our datasets in the low-end cluster for non-production scenarios. For production scenarios, other teams bring their own cluster to which we can attach databases using Azure Data Share.

In this case, we maintain the cluster dedicated to exploration, but we can scope this to a low-end, cheap SKU, since it isn’t meant to support business-critical processes. If a team connecting to it realizes they need more compute or support a production scenario, we can transition them off this replica and to a data share.

A low-end SKU replica for exploration protects our compute workloads and provides a stepping stone towards a data share scenario.

These best practices should help optimize the cost of distributing data and avoid unnecessary data movement.

Summary

· Distributing data is a data engineering challenge for a big data platform.

· We need to ensure the compute workloads of our data platform aren’t impacted by downstream data copy.

· For low-volume/high-frequency data requests, we can use a data API to share data.

· Cosmos DB is a document database storage solution optimized to serve as an API backend.

· For bulk data copy, we can either share data from storage accounts (no compute) or provision database replicas.

· Azure Data Share is an Azure service specializing in sharing data, using common concepts (share, dataset, invitation) regardless of the data fabric.

· Always keep in mind the cost of copying data around and avoid pass-through sharing (data that is also available upstream).

· A low-end replica is a good way to enable other teams to explore the datasets available in your data platform without impacting other workloads and with a small additional cost.

If you want to learn more about the book, check it out on Manning’s browser-based liveBook platform here.