Infrastructure-as-Code (IaC) is great. It allows teams to deploy infrastructure quickly in a consistent and repeatable manner and when coupled with a proper CI/CD process (linting, SAST (Static Application Security Testing), 4-eyes reviews, multi-env, …) it creates a powerful security framework for platform and development teams.
However as great as IaC is, it does not prevent users from making mistakes; it is easy to wrongly expose a service and thus increase your attack surface or to deploy it in a way that conflicts with your compliance requirements.
Luckily Cloud providers offer guardrail functionalities to catch those mistakes and to enforce your compliance requirements.
Azure provides it under the Azure Policy feature; for Google it is called Organization Policies and for AWS (Amazon Web Services) it is Service Control Policies. The implementation methods and features differ between the different cloud providers, but the idea is the same; provide central control over the maximum available permissions for the accounts in your organization.
In this blog post I’ll cover how you can manage these guardrails as-code and how you can code unit-tests to validate them. We will focus on the Azure platform, but it could be easily reproduced on the others.
(all the code present is this blog-post is available on the following Github repository)
You need to start somewhere, and the blank page syndrome can be hard to overcome… My recommendation is to start with low hanging fruits. First, things that you never want to see happening in your production environment no matter what.
Ideally you should also pick things that will have little to no impact on your platform and development teams – it’s always frustrating to rollback or make people angry at the beginning of a new project. Also, keep in mind that you will need the support of these teams to progress on your Cloud-native security journey, so it is best if you have their buy-in day 1.
We’ll start with two simple use-cases:
- We’ll forbid the deployment of resources in certain regions.
- We’ll forbid certain source-addresses / destination ports patterns when creating Network Security Groups (NSG).
Now we need to define where in the accounts hierarchy we want to apply these policies. Azure recommends organizing Subscriptions under Management Groups. In this blog post that’s where we’ll assign them, but they could be just as readily assigned at the Subscriptions level or at the Resource Groups level as well. Assigning policies at the top of the tree means that all the objects downwards will inherit them.
Policies can be defined inside the Azure Portal Policy page under “Authoring / Definition” but we’ll do it as-code using Terraform today.
We’ll focus on the policy for forbidden regions here but the code for both is available here.
- As you can see this policy takes parameters (“deniedRegions” array) as input so it’s in fact a template.
- In the policies definition code we have set the “effect” to “deny” which means that resources meeting these specific conditions will be denied at creation-time.
- A smart way to implement policies is to set them in “audit” mode first to identify the potential impact and in “deny” mode to enforce them afterwards.
Now that the policies are created, we need to assign them to our Management Group (“Authoring / Assignments in the Azure portal). Without an assignation a policy has no effect.
That’s what we’ll do with the following code (source available here):
- The “deniedRegions” key inside the parameters block, this is where we can set the variable that is applied inside the policy definition.
- Something also useful is the ability to define a non-compliance error message inside the policy assignment resource. This allows us to answer with a meaningful error message for the users when they deploy non-compliance resources, and it gives us the flexibility to define different error messages for the different assignments you may define (per subscription / teams, resources groups).
Testing the policies
Now that we have created and assigned our policies we can do a quick test from the Azure portal, by trying to create a new NSG rule that voids the policy or by deploying a resource in a forbidden region.
We can see that our policies prevented the deployment of the resources, and that it did so during the evaluation process of the resource creation. This is quite powerful because it will warn users even before deploying the offending resources, instead of crashing with an error message at deployment-time.
The same will happen if we try to deploy the resource as-code, we will find the non-compliance error message inside the query response body.
A good policy framework must also support exemptions for certain rare use cases. Azure Policy provides that out-of-the box. We won’t cover exemptions in this blog post but they follow the same logic as policies assignments, you define a scope, a reason and optionally an expiration date for the exemption and that’s it.
For more details check the official documentation from Azure.
Unit-testing with Golang
I could have stopped here, but I wanted to go a bit further than that. These policies are working as expected, but I still had the following questions open in my mind. How to make sure that:
- The policy really blocks what it’s supposed to?
- The policy behaves properly over time? For example, if some wrongly scoped exemptions are created?
- The policy does not block something it is not supposed to?
- The expected policy prevents deployment of a non-compliant resource in case there is overlapping policies?
Some of the questions above could be answered by the principle of least privilege, locking in the policies at the highest org-level and granting access only to certain super-admins. With infrastructure as-code, however, it is usually a service account that is used to push the changes on the platform and even with four-eye reviews, a mistake could happen.
To answer all these questions, I decided to implement unit tests. Initially I thought about doing them using Terraform code, but it was a bit cumbersome to catch and parse error messages and would mean that I had to wrap some bash or Golang around the Terraform code to achieve what I wanted to do.
Instead, I decided to implement the resource creation process using Golang code (leveraging the Azure SDK for Golang) and simply define unit tests using the standard testing package.
Here’s an extract of the testing scenarios for the forbidden regions (full code here).
And here’s an extract of the logic to iterate over all the tests (full code here):
Here’s an extract of the testing scenarios for the forbidden NSGs (full code here).
And here is how it looks like when being executed (here the NSG tests):
We can see that all the tests successfully passed as expected.
In this blog post we saw how to improve our Cloud-native security by implementing guardrails as-code leveraging Azure Policies.
We also saw how to make sure that these policies remain consistent over-time by coding some unit-tests in Golang.
It is understood that all the Terraform code and the unit tests must be part of steps in a Continuous-Integration/Continuous-Deployment (CI/CD) pipeline.
All the examples above are available on the following Github repository.
Finally if you have questions about cloud-native security or security in general feel free to reach out to us !