An obsessive commitment to automation

I have an unapologetic and relentless obsession to automation. It drives the vast majority of my decision making, both in personal and work contexts, and also the culture of the teams I build.

This is a story of how I identified three manual pain points on a project, and completely automated them away. This eradicated both human error, and human laziness. I’ve often found the laziest engineers make for the best engineers, as they’ll find the simplest solution - and if repetitive - automate the task so they only have to do it once.

We aim to automate every (sensible) aspect of our systems, so engineers don’t have to read docs to remember how to get a service to production - the pipeline/code just makes it happen.

Automation Problem 1: Not all repos have the same settings #

Once you grow past only a handful of repositories, it becomes problematic to keep all the settings the same, and doubly so if you decide to change a setting! We’ve settled on many settings in GitHub in my current project, here’s a quick rundown:

  1. Repo Webhooks
    • We add a specific webhook to every repository to maintain labelling standards (semver labels, and ensure a conforming description)
  2. Repo Labels
    • We add labels to every repo, managing the colours so they’re also always consistent (this was added before GitHub supported Organisation Labels, and we’ve kept it so we can exclude one or two repos)
  3. Team permissions
    • All repositories carry the same permissions in our Org, things like ReadOnly, Dev for write access and AllReposAdmin for admins.
  4. Repo Settings
    • We disable issues, projects, and wiki (as we use JIRA for this)
    • We only allow Squash Merge
  5. Branch Protection
    • We use feature branches, with reviewed PRs to merge to master - meaning we need to protect our master and enforce reviews.
  6. Signing Protection
    • We enforce GPG signed commits on every commit, and this prevents a PR from being merged if it doesn’t comply

Phew! That’s a lot of settings for every developer to get right when they setup a repository from scratch. Enter the repo-conformity tool. I wrote this tool to automate away all the manual clicking GitHub’s UI.

At the time we had 60 or so repos, which all had different settings, but in theory should have all been the same. We now have over 200 but all with identical settings. It’s now open source too: https://github.com/florx/repo-conformity-enforcer

Automation Problem 2: CircleCI pipeline drift #

Similarly to the above problem where the GitHub settings were misaligned, which may seem like a trivial problem (until someone accidentally commits and pushes to master…), our CircleCI pipelines were all slightly different.

This is surprising to me given all of our services are the same language, they are built the same way, and they deploy to every environment the same way. Once a PR is merged to master, it will sort out getting that code into production with only a 1-click-promote step after our preproduction environment.

Engineers were simply copy/pasting the circleci.yml file and using find/replace to make it work with their new shiny service then customising it to make the build work… if this doesn’t sound scalable or maintainable. It wasn’t.

Enter circleci-templater, this tool takes an inventory (our list of repositories, and what type they are), and will generate an appropriate circleci.yml file based on that info. It also creates a new branch, checks in the change and pushes it - ready for a review!

This means every one of our pipelines across our 60+ deployable service repos are identical, and operate exactly the same. It also makes it super easy to roll out a new step, allowing us to add new security/vulnerability scans, automated SAST/DAST with only a few lines of change in the template, and it will create PRs for all repos that need changing.

This one isn’t ready to be open sourced yet as it’s tightly coupled to our specific environment - but I’m working on it.

Automation Problem 3: Forgetting to promote to production #

Remember that “1-click-promote to PROD” from earlier, well that also causes problems. We’ve introduced a manual step! Humans are terrible at both remembering about manual steps, but also executing them. Machines on the other hand, they never forget.

We started to get deployment drift between staging and production over time, with a lack of visibility in to which services had been promoted, and which hadn’t. The CircleCI master workflow would tell us, as it would be On Hold, but it was majorly monotonous to go through every service one by one (especially when there are 60).

Enter prod-release-checker, this little tool does a comparison between a prod tag and master to ensure it has been released. If not it will report twice a day how many releases behind a given service is.

It ends up looking like this, service name, number of releases behind and the last person to touch the repository. We don’t do blame, but this is an excellent way to keep people accountable for releasing their own changes to PROD. After all, it’s not done until a customer can use it!

service-a is behind by 1 releases (florx)
service-b is behind by 1 releases (some-other-user)
service-c is behind by 3 releases (dependabot)

(Psst, this one is OSS too! https://github.com/florx/github-check-prod-releases - or read more about it https://blog.florxlabs.com/going-fast-and-keeping-track-of-releases)

Wrap up #

Automation, and building in tools to ensure consistency across your different services and infrastructure allows you and your team to move much faster. If you know everything works in an identical way, you can quickly dismiss many debugging steps when something goes wrong (like a CI build).

It also allows you to experiment and introduce new features in the dev team really quickly. We added docker vulnerability scanning into our pipeline within an hour, because we could roll out a change to every pipeline in only a few clicks.

Let me know what you’ve automated! - @florx

 
6
Kudos
 
6
Kudos

Now read this

Shut DEV down… at night?

Thinking about shutting down your non-production environments at night, but wondering about the benefits? Before we started shutting down at night, we were spending a small fortune on hosting environments that simply weren’t used during... Continue →