Shut DEV down… at night?

Thinking about shutting down your non-production environments at night, but wondering about the benefits?

Before we started shutting down at night, we were spending a small fortune on hosting environments that simply weren’t used during nights and weekends. Plus we didn’t have a good recovery strategy plan, if our infrastructure failed - so we decided to combine the two, and solve with one.

I’ll talk about what it means, the benefits and how to get started, in your own adventure of shutting down non-prod at night.

A zoomed in image of finger about to press the on/off button on a MacBook pro

What do I mean, by nightly shutdowns? #

By a nightly shutdowns, I actually mean terminating all of your non-production cloud servers at a specified time (e.g. 9pm) and then starting new instances again in the morning (7am). You can also leave these environments “off” over the weekend too, by simply not starting the instances on the Saturday and Sunday mornings.

Are you crazy? That’s an insane amount of work per night! #

Who’s got time to manually turn everything off as we leave, to turn them all back on again as we arrive in the morning. Also something could go wrong - then we’re left without a development or test environment and it would set the teams back by several days!

Having nightly shutdowns as a goal will help you and your team achieve a few things that I believe are fundamentals to successfully running in a cloud environment.

Let’s dissect the quote above, “who’s got time”: nobody should be doing this manually, it should be completely automated. A machine tears it down, and a machine brings it back up. Every day. Exactly the same. This level of automation will enable you quickly and reliably roll out changes to the underlying infrastructure, and know that these changes will be rolled out automatically in the morning.

Image of a yellow digger tearing a wall down. There is debris, and the wall is about to fall over. It's set against a blue cloudy sky.

“Something could go wrong”, we hope not, but wouldn’t you want to know that your deployment/automation scripts were broken in your development/testing environments, rather than in production during a disaster recovery scenario? The reality running in a cloud environment is that all instances are at risk of being removed at any time by the cloud provider - or crash due to an underlying hardware failure. It’s important to design for this level of fault-tolerance.

Benefits of nightly shutdowns #

There are three main benefits:

  1. Cost saving
  2. Never having to patch instances
  3. Testing fault tolerance

Cost Saving #

This is the easiest to quantify, so here’s a graph of instance costs broken down per hour. You can clearly see the weekdays, and the weekend gaps where we keep our two environments offline.

A graph showing spikes during weekdays and flat low spots during the night and weekend, it has two colours signifying a split of environment

There are roughly 730 hours per month, but only 160 working hours (based on an 8hr day, over 20 days). If you allow for a bit of flex in the working times (e.g. 7am - 9pm), it’s 280 hours. That’s 40% of the total monthly hours (and therefore potentially 40% of the cost!).

Never having to patch instances #

One of my philosophies in life is never to patch instances, so we don’t. We terminate and get new instances every day on non-production (and roll weekly on production). This means we also get the latest image that’s available (ready with the latest updates and security fixes).

Patching an in-place running instance can be risky and time consuming, if you’re updating the kernel - some times this will require a full restart. If restarting anyway why not get a brand new, fresh instance?

Image of a yellow neon sign in a window saying the word Fresh

As we also run stateless services, and all the logs are streamed off each server to a centralised service, we don’t need to worry about log rotation or disks filling up. We can terminate disks at the same time as our instances, and get a new one.

Testing fault tolerance #

Ever heard of Chaos Kong? He’s part of the Simian Army, and is the tool that kills an entire region… whilst we’re not completely simulating that with a nightly shutdown, we are going a significant way to testing all our startup scripts and plans.

Everyday we get to check and test (automatically) that all of our services can survive being offline completely for a number of hours, and resume working normally later. They also check that they come back without any human intervention once the servers start up. We can also take measurements to see if we’re slowing down the boot process or speeding it up over time.

I can now sleep at night, knowing that if we need to invoke our disaster recovery procedures, all our startup scripts will work as expected - as the’ve been tested in a real environment every day.

Thinking of getting started? #

Image of a red and pink neon sign saying Game On, against a black background. The words are a surrounded by a neon square boxing them in

Let me know how you get on! @florx

Credits #

 
2
Kudos
 
2
Kudos

Now read this

What’s in a version?

Building distributed systems in using a microservice pattern is hard. At my company we’re always looking for ways to automate any manual processes, or anything that is difficult. Computers don’t make mistakes, but humans aren’t... Continue →