Data integration in Azure has been made simple through products like Azure Data Factory (ADF) and Synapse Pipelines, but when you’re working with a secure network topology, things can quickly become complicated! In this series we’ll look at how you can use different ADF and Synapse integration runtimes to enable secure data movement for your data platform.
In Part 1 we will cover the three most common architecture patterns:
- Azure Integration Runtime with Managed Virtual Network
- Self Hosted Integration Runtime
- Hybrid of (1) and (2)
In Part 2 we will discuss a much more recent architecture pattern that doesn’t seem to be that widely known. There is a Microsoft Docs article about it, but other than that I couldn’t find anything else out there, so I’m offering up my own explanation of how it works!
This post is aimed at people who have some basic experience with either ADF or Synapse pipelines, and a basic understanding of Azure network components although don’t necessarily understand how they all come together.
Why am I lumping ADF and Synapse Pipelines together?
Synapse Pipelines is exactly the same product as ADF - just repackaged, so everything I’m going to talk about in this post will be relevant to both ADF and Synapse Pipelines. I will be using ADF in all of my examples, however just know that they will all be applicable to Synapse Pipelines, if that’s your integration tool of choice.
Different types of integration runtimes
There are 3 types of integration runtimes:
- Azure IR
- Self-hosted IR (SHIR)
- Azure-SSIS IR
Let’s cover what they do at a high level.
The Azure IR is the default IR that you get with ADF or Synapse Pipelines. The key features are:
- It is a managed resource, meaning Azure handle all the compute infrastructure for you;
- It’s serverless compute;
- It has elastic scaling so you don’t need to worry about specifying the size of the IR;
- There is an option to enable a Managed VNet for the IR, and use Managed Private Endpoints to allow your IR to talk to your Azure services inside private networks.
A self-hosted IR (SHIR) is installed on your own machine. The key features are:
- Can be installed on an on-prem machine or a VM;
- There are high availability options
- You are repsonsible for the compute infrastructure;
- You are responsible for the networking of the compte resource.
The Azure-SSIS IR is simply used for executing SSIS packages. Since we’re not focussing on SSIS, I won’t be covering this IR in this post.
I’m going to cover 3 secure architecture patterns using IRs (and a fourth one in part 2). We’ll discuss the pros and cons of each approach, and at the end I’ll explain the decision-making process behind selecting the right IR for your scenario. If you want to skip ahead to the end you can do so here.
Scenario 1 - Azure IR with Managed VNet
Consider a simple data platform in Azure consisting of a SQL Server, a Data Lake, and an Azure Data Factory. If we are wanting to secure this platform, we would likely have a VNet containing a subnet which holds the Private Endpoints for our resources. These Private Endpoints provide us with a secure way of accessing our data. This architecture would look something like this:
In this scenario, we are going to use an Azure IR with the Managed VNet enabled. Since these are managed resources, they will sit outside of our platform resource group and be managed by Azure. We will also need to create two Managed Private Endpoints (one for the SQL Server and one for the Data Lake) so that our Azure IR can securely access our data. This enhances our architecture to something like this:
Scenario 1 Pros
- Since the VNet is managed, you are not responsible for configuring network connectivity for ADF to access your resources. Yes, you need to create the Managed Private Endpoints, but all the network configuration and funky DNS stuff is handled by Azure.
- You don’t need to maintain any firewall rules to ensure your IR can access your resources.
- There is a nice UI for creating and monitoring your Managed Private Endpoints.
- You don’t need to configure and maintain the runtime server - Azure handles that! Plus you get the extra benefit of auto-scaling.
Scenario 1 Cons
- In this scenario you don’t have any control over your compute resource or the address space of the network it’s deployed into. Now I would argue this is a good thing - it’s less responsibility and hassle for you, however, some stakeholders may not understand that
less control != less secure, so you might get some push back on this. Something to be aware of.
- Your IR compute resource will be deployed in the same region as your ADF resource, so this could be an issue if you have certain governance and compliance requirements about where you’re processing your data.
- As you can see from our diagram, we are still using (unmanaged) Private Endpoints within this architecture. So although this scenario is really nice because you can use Managed Private Endpoints for IR connectivity, you will likely still need to know how to create, configure and maintain Private Endpoints if you’re working within a secure network topology.
- Previously if you had wanted to use this architecture to connect to on-prem data sources, I would have said it’s not possible, hence being the biggest con for this approach. However, there is now a way of extending this architecture to connect to on-prem, but it’s not pretty and significantly more complex (we’ll cover this in part 2). Therefore I am still leaving it as a con!
Scenario 1 Summary
In summary, the Azure IR with Managed VNet is good for most architectures. It’s definitely the easiest option to set up and maintain, so I would recommend this option if it satisfies all of your requirements. However, if you require on-prem connectivity, this set up no longer works - or at least it requires some additions.
Scenario 2 - SHIR
In this scenario we will be using the same base data platform architecture, but we’ll be using a SHIR to connect to an on-prem data source. Our architecture now looks like the following:
Let’s explore the extra components we’ve added to the diagram.
Firstly, we’ve installed an SHIR on an on-prem server inside of our corporate network. In order for this server to be able to talk to resources, we need some connectivity in place between our on-prem network and our cloud environment. In this example I’m using an ExpressRoute but it could be a VPN.
Now the ExpressRoute is a tunnel allowing traffic from on-prem to travel to the cloud. It’s not likely that the ExpressRoute will connect directly into our data platform; it’s more likely that you will have a hub-and-spoke architecture, where all traffic entering the cloud does so via the hub network and is controlled by a firewall, which then filters the traffic out to the spokes. Think of the EspressRoute as a motorway connecting two major cities; it’s not going to take you directly to your house but it will get you close.
In this example I’ve added a hub network and connected this to my spoke network via VNet peering. This completes the route from the on-prem network to our Platform VNet where our Private Endpoints are i.e. our SHIR server can access our resources.
Speaing of Private Endpoints, you may have noticed that we have added an extra one. ADF offers a specific Private Endpoint which can be used for secure connectivity to and from an SHIR. I’ve highlighted this route in the image below.
The icon on the left handside of the diagram is for Azure Relay; I won’t go into detail here, but ADF uses Azure Relay for command and control of the SHIR. Since this is a public service, your SHIR server requires outbound internet access on port 443 to Azure Relay in order to complete this architecture. If your SHIR server sits behind a web proxy and you’re having difficuly allowing public traffic out to Azure Relay whilst also sending private traffic over the VPN to your private endpoints, check out my post on how to get that working!
Scenario 2 Pros
- In this setup, you have full control over your compute infrastructure, which could be required for regulation and compliance reasons as discussed previously.
- This is a very tried and tested method, so there’s loads of information online if you need support.
- This biggest pro of this architecture is that offers fairly simple connectivity to on-prem data sources. Sure, it’s a bit more complex than Scenario 1, but we need those extra elements to facilitate that on-prem connectivity.
Scenario 2 Cons
- This architecture requires either: pre-existing network infrasturcture e.g. ExpressRoute, or a dedicated network team that can put that infrastructure in place, or lots of your own networking knowledge!
- There will also likely be maintenance of lots of firewall rules and network config. For example there will likely be a firewall in the hub network that will need rules adding to, and there could be route tables that need updating. You might also have on-prem firewall that need rules adding to.
- If you have high performance requirements, you could require a pretty powerful server for your SHIR, which could drive up cost quickly.
- You are responsible for a lot more things in this scenario compared to Scenario 1. You are repsonsible for:
- Network connectivity to on-prem data sources from the SHIR server,
- Network security of the SHIR server,
- Network connectivity to Azure services (likely via private endpoint),
- Provisioning and maintaining the compute resource.
Scenario 2 Summary
In summary, if you require on-prem connectivity then I would recommend this architecture. The architecture pattern covered in Part 2 also facilitates on-prem connectivity, however I would argue that this solution is the easier of the two to both set up and maintain.
Scenario 3 - Hybrid
This scenario has no new components in it, it’s simply a mixture of scenarios 1 and 2.
On the left hand side you can see the Azure IR with Managed VNet architecture, and at the bottom you can see the SHIR architecture. Let’s explore when this scenario might be a good option.
You would only consider using this approach if you need on-prem connectivity. If you don’t need on-prem connectivity, then you don’t need to go near the SHIR. Obviously with the hybrid approach you gain all of the pros and cons of each individual approach that we’ve already discussed, since you’re implementing both architectures. However by using both, you can mitigate some of the cons of one approach by using the other one in certain scenarios, and vice versa.
Reasons to use the Hybrid approach
- Save on Cost: When connecting to on-prem data sources, you really only need the SHIR to do the movement of data between on-prem and the cloud, say, into a landing zone. Once your data is in the cloud, using the SHIR to move data around could be overkill. Using the Azure IR with Managed VNet to move your data around in the cloud could allow you to size down your SHIR sever and potentially save on cost.
- Concurrency limits: A SHIR has concurrency limitations based on it’s size. The only way to increase this is to add more RAM and CPU to the server, or deploy another server and unitilise the High Availability options. However using the Azure IR you can take full advantage of the elastic scaling, so can be a much better option for high concurency jobs.
Scenario 3 Summary
Overall I would say the Hybrid architecture is good for when you need on-prem connectivity but also have performance / cost / concurrency requirements.
Conclusion: How to choose the best architecture for your scenario
We have touched on when it’s best to use each of the different architectures we’ve discussed, but here is a flow diagram to summarise the decision making process.
In Part 2 of this blog series I will discuss another architecture for secure data integration using ADF/ Synapse Integration Runtimes, and update the decision flow chart accordingly! This is one a bit more complex, and as an architecture pattern it’s definitely less widely used, however I find it an interesting architecture to talk through nonetheless.