An unambiguous best practice for systems is to codify your infrastructure in a repeatable text format, rather than clicks in a UI. The common name for this is Infrastructure as Code. By far, the two most popular ways to do this, especially in AWS, are Terraform and CloudFormation.
I used Terraform for about 3 years at a startup before working for Twitch(AKA Amazon Jr) where I used Terraform very heavily before the company pushed hard to switch to all things Amazon, including CloudFormation. I tried very hard to develop best practices around both and used each for very complicated operational setups that spanned organizations. After thinking hard about the switch from Terraform to CloudFormation, I feel strongly that Terraform is probably the right choice for your company.
Terraform the ugly Link to heading
Beta software Link to heading
Terraform is not 1.0 yet and that is a very legitimate reason not to use it. It has changed a lot since I’ve started
using it, and it was very common that terraform apply
would fail to work after a few years and a few Terraform
upgrades. I feel like “now it’s different”, but …. doesn’t everyone say that? I’ve agreed with all of Terraform’s
backwards incompatible changes and feel the syntax and resource repository abstractions have gotten to a good place.
I really do think it’s different this time, but … :-0
On the other hand, AWS has done a really good job maintaining backwards compatibility. This is probably because their services often get dog fooded quite a bit internally before becoming renamed and public. “Good job” is probably an understatement. It is ridiculously difficult to maintain backwards compatible APIs for a system as diverse and complicated as AWS. Anyone that’s maintained a public API as widely used as AWS has to respect how difficult a job that is over so many years. I’ve never had a situation where CloudFormation behavior changed years later.
Foot … meet gun Link to heading
As far as I know, it is impossible to delete another CloudFormation stack’s resource from your own CloudFormation stack. This is almost true for Terraform. Terraform allows you to import existing resources into your own stack. This is actually a really awesome feature, but with great power comes great responsibility. Once the resource is in your own stack, it’s possible to modify or delete it while working on your stack. This isn’t a hypothetical problem, either. At Twitch, the site actually did have an issue once where someone, working in good faith, imported someone else’s AWS security group into their own Terraform stack by accident. A few commands later, and the security group (and all inbound traffic) were gone.
Terraform the Great Link to heading
Recovery from incomplete states Link to heading
Sometimes CloudFormation is unable to fully transition from one state to another. When it cannot continue, it will try to revert to the previous state. Unfortunately, this isn’t always possible. When it’s not, fixing it can be a bit scary as you’re not totally sure CloudFormation will be happy with whatever hacks you do to resolve this. CloudFormation is also bad at detecting if it is now impossible to transition back to an old state and by default, would hang for hours waiting for something that can never happen.
Terraform tends to recover from incomplete state transitions more gracefully and gives you the advanced tools you need to fix your state to what you expect.
More clearly document state changes Link to heading
Yes load balancer, you’re changing. But how?
— worried engineer about to click “accept”
Sometimes I would need to do things to a load balancer in a CloudFormation stack like add a port number or change a security group. CloudFormation will give very little information about what exactly is changing. Worried, I check my yaml file 10 times to make sure I didn’t delete or add the wrong thing.
Terraform is much more transparent about what is changing. Sometimes it is a bit too transparent: i.e. overwhelming. Luckily, the latest version of Terraform has included better diff output to see exactly what’s changing.
Flexibility Link to heading
Write all software assuming you’ve gotten everything wrong.
Unambiguously, the most important long term trait of good software is the ability to adapt to change. Write all software assuming you’ve gotten everything wrong. A common problem for me was I will start with a “simple” service and decided to put everything inside a single CloudFormation or Terraform stack. Of course, months later I realize I got it wrong and this service isn’t simple at all! I now need to abstract somehow the previously large stack into smaller parts. When I’m using CloudFormation, this is impossible without recreating my existing stack, which I’m absolutely not doing to my databases. For Terraform, I was able to do this surgery and break it down into smaller, easier to understand parts over time.
Modules in git Link to heading
Terraform code is much easier to share between multiple stacks than CloudFormation. With Terraform, I can place my code in a git repository and use semantic versioning to reference it. Anyone with access to that git repository can reuse this common code. CloudFormation’s equivalent is putting the file inside S3, which lacks all the familiar benefits and reasons of why we store code in git over S3 in the first place.
As the organization grew, the ability to share common stacks grew critical. This is just such an easy and natural process in Terraform while CloudFormation makes you jump thru hoops to get something similarly working.
Operations as code Link to heading
We’ll just script it
— Engineer 3 years away from reinventing Terraform
There’s so much more than the code in your Go or Java program that goes into Software Engineering.
There’s also the infrastructure that it runs on.
But how does it get there? How do you monitor it? Where does your code live? Are there access permissions for engineers?
Being a software engineer is so much more than writing code
You probably use some service provider besides AWS. Maybe it’s SignalFx or PagerDuty or Github. Maybe you have an internal Jenkins server for CI/CD or internal Grafana dashboards for monitoring. Every reason people choose to do Infra as code is just as valid for everything else that goes into software.
When I was working at Twitch, we would spin up services inside Amazon’s mixed internal and AWS systems. There was operational overhead quickly creating and maintaining multiple microservices. Conversations would go something like this.
- Me: Geez, that’s a lot of steps to spin up a microservice. I have to use this thing to make an AWS account (we were moving to 2 AWS accounts per microservice), and this thing to set up alerts, and this thing to set up my code repository, and this thing to set up e-mail lists, and this …
- Lead: We’ll just script it.
- Me: Ok, but I’m sure that script itself will change. You’ll want some way to make sure all these internal amazon systems have a state that is up-to-date.
- Lead: Sounds good. We’ll make a script for that.
- Me: Great! That script will probably have parameters that need to be passed in.
- Lead: Of course, the script will take parameters
- Me: This setup may change in backwards incompatible ways. We may want semantic versioning somehow.
- Lead: Great idea!
- Me: People may modify these tools by hand inside the UI. We’ll want some way to audit and correct that.
- … 3 years later
- Lead: And now we have terraform
The moral of this story is that even if you’re literally amazon you still have services outside AWS that have state that could use a configuration style language to keep that state in sync.
CloudFormation lambda vs terraform git modules Link to heading
The CloudFormation solution to custom logic is lambda. You can use lambdas to create a macro or a custom resource. This approach presents extra complications that don’t exist in Terraform’s approach of semantically versioned git modules. The most immediate problem for me was managing permissions to all these custom lambdas across dozens of AWS accounts. The second was the chicken/egg problem that lambda code presents: the lambda itself is also infrastructure and code that itself needs to be monitored and updated. The final nail in the coffin was the difficulty semantically updating changes to the lambda’s code and ensuring that a stack actions wouldn’t change between runs without direct involvement.
I remember once I wanted to create a canary deploy for my Elastic Beanstalk environment behind a classic load balancer. The easiest way to do this was a second EB deployment next to my production deployment, with an extra step of associating the canary deployment’s auto scaling group with the production deployment’s LB. Since Terraform exposes beanstalk’s ASG as an output, it’s an extra 4 lines of code in Terraform to do this. When I asked for a comparable solution in CloudFormation, I was pointed to an entire git repository with a deployment pipeline and everything else: all just to make a lambda that could eventually do these 4 lines of Terraform.
Better drift detection Link to heading
Verify reality matches expectation
Drift detection is a very powerful feature of operations as code because it verifies that reality matches expectation. You can do this with both CloudFormation or Terraform. CloudFormation’s drift detection gave too many false positives as the operational stack grew.
With Terraform, you have much more advanced lifecycle hooks to make drift detection possible. For example, you ignore_changes directly on an ECS’s task definition if you want to ignore changes to a particular task’s definition without ignoring changes to your ECS deployment as a whole.
CDK and the future of CloudFormation Link to heading
CloudFormation is very difficult to manage at a large, cross infrastructure scale. A lot of this difficulty is admitted to in the strong need for things like aws-cdk, a framework to define cloud infrastructure in code and provision it through AWS CloudFormation. It will be very interesting to see where aws-cdk goes in the future, but it will be difficult to compete with the other advantages Terraform has without major improvements to CloudFormation itself.
Cures to Terraform frustrations Link to heading
It’s “Infrastructure as CODE” not “Infrastructure as text”
My first impressions with Terraform were pretty bad. I think this comes from a misunderstanding to approaching Terraform. Most engineers that start working on Terraform see it, unintentionally, as a text format that they need to morph into eventually getting the infrastructure they want. DO NOT DO THAT.
Universal truths of good software engineering translate to Terraform
I’ve seen many practices that are universally accepted for good code just ignored in Terraform. You’ve spent years learning to be a good programmer. Don’t throw all that away just because you’re using Terraform. Universal truths of good software engineering translate to Terraform.
Would you not document code? Link to heading
I’ve seen huge Terraform stacks lacking all documentation. Would you write pages of code without any documentation at all? Add documentation explaining your Terraform code (special attention to the word code), why that section is important, and what you’re trying to do.
Would you deploy a services that was one huge main() function? Link to heading
I’ve seen very complex Terraform stacks represented as a single module. Why do we not deploy software like that? Why do we break large functions into smaller functions? The same reasoning applies to Terraform. If your module is too large, that’s a sign you should break it into smaller modules.
Would your company not use libraries? Link to heading
When people wanted to spin up a new project using Terraform, I saw engineers copy/paste huge chunks from another project into their own and hack on it until it worked. Would you do that at your company for “real” code? There’s a reason we use libraries. Of course, not everything needs to be a library. But simply never using shared libraries?
Would you not use PEP8 or gofmt? Link to heading
Most languages have a standard, accepted formatting scheme. Python has PEP8. Go has gofmt. Terraform has one too: terraform fmt. Use it!
Could you use React without knowing JavaScript? Link to heading
Terraform modules can abstract some of the complexity of infrastructure, but they don’t wash away your need to understand the infrastructure itself that you’re creating. If you expect to correctly use Terraform without understanding the resources behind it that it’s creating, you are doomed to be unable to adapt your Terraform as time goes on.
Do you code with singletons or dependency injection? Link to heading
Dependency injection is an accepted best practice for software engineering over singletons. How does this translate to Terraform? I’ve seen Terraform modules that rely on remote state. Instead of writing modules that pull from remote state, write a module that accepts parameters. Then, pass those parameters into the module.
Do you write libraries that do 10 things OK or 1 thing really well? Link to heading
Software libraries work best when they have a limited focus and do that thing very well. Rather than write large Terraform modules that try to do everything, compose them into parts that focus on one thing well. Then, combine these parts into what you want.
How do you make backwards incompatible changes to libraries? Link to heading
A shared Terraform module needs a way to communicate backwards incompatible changes to consumers just like regular libraries do. And just like it’s annoying when libraries change in incompatible ways, it’s also annoying when Terraform modules change in incompatible ways. I strongly recommend using git tags and semver when consuming Terraform modules.
Does your production service run on your laptop or in a datacenter? Link to heading
Hashicorp has tools like terraform cloud for running your terraform. These centralized services make it easier to manage, audit, and approve terraform changes.
Would you not write tests? Link to heading
Engineers accept that code should have tests but often ignore tests for Terraform. This can be a bit tricky for infrastructure. I recommend “testing” or “example” stacks with your modules that you can verify deploy correctly during CI/CD.
Terraform and microservices Link to heading
Microservice companies live and die on how fast they can spin up, update, and destroy new microservice operational stacks.
The most common long term frustration I see with microservice architectures is the operational side: not the code side. Thinking of Terraform as a way to automate just the infrastructure side of a microservice architecture limits the true advantages of the system. It’s now everything as code.