I had a curious issue this week that I could not explain for some time. I eventually got to the bottom of it with some assistance from Azure Support and thought I would share some details in case anyone else comes across the same issue.
The problem:
I had a couple of Virtual Machines in an availability set, both newly deployed Windows Server images. Normally I would use a basic SKU public Load Balancer to NAT a custom RDP port to one or both Virtual Machines and then lock this down with a Network Security Group to restrict by source address. On this occasion however I had deployed and associated a public IP address to the NIC on one of the Virtual Machines in order to get them both configured. All working OK on this Virtual Machine but then I went to configure the second Virtual Machine in the availability set. I made an internal RDP connection to the second machine and I noticed that I could not route out to the Internet from this Virtual Machine.
I initially thought this was just an anomaly as both Virtual Machines were sharing the same Network Security Group (I always apply this at the subnet level where possible). The network watcher tools confirmed the same, that I could not route out to the Internet. I tried several things like deploying a new NIC, even deploying a new Virtual Machine, all with the same result. At this point I was totally baffled and even deployed a completely new environment – new VNET, new availability set with a new pair of Virtual Machines…same result. The only workaround was to put a public IP address on the second Virtual Machine as well also but this was something I wanted to avoid. It’s always better to use as few public IP addresses as possible.
Now I know a little bit about how the routing works in Azure but whilst troubleshooting I came across this very informative article below which helped me to understand it a lot better. I am scenario 3 in the article.
https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections
If you do not have a public IP address assigned to a Virtual Machine either directly on the NIC or via a frontend such as Azure Load Balancer then Azure will dynamically map your private IP address to a public IP address using SNAT so that you can route to the Internet. If you are using an availability set then all of the Virtual Machines in the availability set behave as a group in terms of outbound connectivity.
The solution:
I mentioned Azure Support helped me out here so credit where it is due, after consulting with a colleague my support engineer noticed something that they had seen recently and the article posted above confirms this. The issue was because I had a standard SKU public IP address associated to the single Virtual Machine in the availability set. Normally I would always use the basic SKU when assigning public IP addresses unless otherwise required and in fact I actually hadn’t even noticed* it was a standard SKU until pointed out to me.
*I believe this has recently changed to the standard SKU option in the portal as a default setting and I simply hadn’t noticed the SKU when deploying.
Once I swapped out the standard SKU public IP address for a basic SKU things worked normally and I could route out to the Internet on my second Virtual Machine without a public IP address assigned to it.
Conclusions:
Standard SKU public IP addresses and Load Balancers behave differently to the basic SKU. As your availability set Virtual Machines behave as a group they cannot share a standard SKU public IP address unless you frontend it with a Standard SKU Load Balancer and define your outbound rules explicitly.
You can of course just use a basic SKU Load Balancer with a basic SKU public IP address where this is suitable.
Another conclusion from this is to be wary of performing production deployments through the Azure Portal – things get changed a lot and the future default options may well not be what you actually require. Deploying via templates, PowerShell, DevOps etc is definitely the better way to build and maintain your Azure environments.