tag:blog.florxlabs.com,2014:/sitemapJake Hall2023-12-31T07:49:47-08:00Jake Hallhttps://blog.florxlabs.comSvbtle.comtag:blog.florxlabs.com,2014:Post/how-to-have-multiple-ssh-keys-whilst-using-the-1password-agent2023-12-31T07:49:47-08:002023-12-31T07:49:47-08:00How to have multiple SSH keys whilst using the 1Password Agent<p>I have a bit of a niche use case when using my laptop; I want to be able to access repositories GitHub.com via SSH but using different user accounts.</p>
<p>This requires multiple SSH keys to switch between to access different repositories. The “easy” (but less secure) way is to have both sets of private SSH keys on disk and use SSH config to select the correct identity file.</p>
<blockquote class="short">
<p>Quote note on examples: they’re based on you using MacOS but might apply to Linux/WSL - however, I haven’t tested it!</p>
</blockquote><h1 id="the-39easy39-but-insecure-way_1">The ‘easy’ but insecure way <a class="head_anchor" href="#the-39easy39-but-insecure-way_1">#</a>
</h1>
<p>In your <code class="prettyprint">~/.ssh/config</code> file.</p>
<pre><code class="prettyprint">Host github-as-user1.com
HostName github.com
User git
IdentityFile ~/.ssh/user1
Host github-as-user2.com
HostName github.com
User git
IdentityFile ~/.ssh/user2
</code></pre>
<p>Then to test and use in a real repo:</p>
<pre><code class="prettyprint">ssh git@github-as-user1.com
git remote add origin git@github-as-user1.com:user1/private-repo.git
</code></pre>
<p>You should see the following successful result:</p>
<blockquote class="short">
<p>Hi user1! You’ve successfully authenticated, but GitHub does not provide shell access.<br>
Connection to github.com closed.</p>
</blockquote><h1 id="the-secure-way_1">The secure way <a class="head_anchor" href="#the-secure-way_1">#</a>
</h1>
<p>I don’t want my private SSH keys stored on disk without a password, and adding a password to them is tedious to type every time. Yes, I could have my SSH agent remember my passphrase - but then I have the data portability issue. My private keys are only on 1 machine, and if I lose them I have to create them all over again.</p>
<p>Enter 1Password and the <a href="https://developer.1password.com/docs/ssh/agent/">1Password SSH Agent</a>. There are good instructions from <a href="https://developer.1password.com/docs/ssh/agent/advanced#use-multiple-github-accounts">1Password on multiple GitHub accounts</a>.</p>
<ol>
<li>Add your SSH keys to “Personal Vault” in 1Password (it has to be the personal vault!)</li>
<li>Enable the <a href="https://developer.1password.com/docs/ssh/get-started#step-3-turn-on-the-1password-ssh-agent">1Password SSH Agent</a> in settings</li>
<li>Keep hold of the public key (or download it from 1Password vault)</li>
<li>Update your <code class="prettyprint">~/.ssh/config</code>:</li>
</ol>
<pre><code class="prettyprint">Host github-as-user1.com
HostName github.com
User git
IdentityFile ~/.ssh/user1.pub
IdentityAgent "~/Library/Group Containers/2BUA8C4S2C.com.1password/t/agent.sock"
Host github-as-user2.com
HostName github.com
User git
IdentityFile ~/.ssh/user2.pub
IdentityAgent "~/Library/Group Containers/2BUA8C4S2C.com.1password/t/agent.sock"
</code></pre>
<p>I found I also had to add the <code class="prettyprint">IdentityAgent</code>, in addition to the 1Password instructions, to make this all work.</p>
<p>Good luck!</p>
tag:blog.florxlabs.com,2014:Post/aws-managed-prometheus-and-grafana2021-01-09T04:46:36-08:002021-01-09T04:46:36-08:00AWS Managed Prometheus and Grafana<p>In the past few weeks, AWS has released into preview two exciting new services. At our disposal, we now have <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html">Amazon Managed Service for Prometheus (AMP)</a>, a Prometheus <em>compatible</em> managed monitoring solution for storing and querying metrics at scale. Additionally, we now also have <a href="https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html">Amazon Managed Service for Grafana (AMG)</a>, which as you would expect, is a fully managed data visualisation service with logically isolated servers. AWS handles all the provisioning, scaling and security automatically.</p>
<p>AMG (Grafana) is currently in limited preview, so I got myself access to it and have had a play to see how both of these new managed services work together for some of the standard ways I see customers using AWS.</p>
<p><a href="https://svbtleusercontent.com/jLXQofQtpuACthKLRrQiSH0xspap.png"><img src="https://svbtleusercontent.com/jLXQofQtpuACthKLRrQiSH0xspap_small.png" alt="Grafana_screenshot.png"></a></p>
<p>It’s worth considering that both of these products are in preview and therefore have limited availability to certain regions. For AMG (Grafana) this is only <code class="prettyprint">us-east-1</code> and <code class="prettyprint">eu-west-1</code>, for AMP (Prometheus) this is slightly wider but still limited currently to <code class="prettyprint">us-east-1, us-east-2, us-west-2, eu-west-1, eu-central-1</code>.</p>
<p>My first impressions are that these services will relieve a significant pain point for many teams and customers currently managing Prometheus and Grafana stacks. Running Prometheus can be simple initially, but doing so in a highly available and scalable way can be tricky, especially given it is a stateful application, storing metrics on disk. There are several ways to scale storage, using remote reads and writes with something like <a href="https://thanos.io/">Thanos</a>. Ensuring Prometheus is highly available is especially important when also using it for incident management or using Prometheus Alertmanager to bring attention to failing or degraded system components.</p>
<p>If you expect AMP to be a fully-fledged Prometheus server that you can configure targets to scrape various endpoints, you’ll be disappointed. As AMP is a “Prometheus-compatible” managed service, it provides no such configuration options. The only way to use this service is to push metrics. AWS has, however, offered several solutions to this problem.<br>
If you are already running Prometheus yourself somewhere, you can <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-ingest-metrics-existing-Prometheus.html">immediately enable remote writes</a> to switch over to the managed service. Using remote writes will <a href="https://prometheus.io/docs/practices/remote_write/">increase memory usage </a> by at least 25%, so if you are close to hitting memory limits, you may wish to scale up before implementing this option. <br>
Otherwise, if you are starting from the beginning, or want to scrape metrics without changing your existing Prometheus setup, the other option is to run the <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-ingest-metrics-OpenTelemetry.html">AWS Distro for OpenTelemetry (ADOT) Collector</a>. This collector will scrape using <code class="prettyprint">receivers</code> and forward metrics to AMP using <code class="prettyprint">exporters</code>. There are <a href="https://aws-otel.github.io/docs/getting-started/advanced-prometheus-remote-write-configurations">configuration examples</a> when using AWS EKS that quickly get you started, including an <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-ingest-metrics-OpenTelemetry.html#AMP-onboard-ingest-metrics-OpenTelemetry-steps">example starter project</a> to make sure everything is working correctly.</p>
<p><a href="https://svbtleusercontent.com/9f1i4W41kYaRQ6cbME7Vcg0xspap.jpg"><img src="https://svbtleusercontent.com/9f1i4W41kYaRQ6cbME7Vcg0xspap_small.jpg" alt="prom_diagram.jpg"></a></p>
<p>A note on security: AMG (Grafana) <em>requires</em> the use of AWS SSO. Whilst AWS introduced SSO a few years ago; I have yet to see wide adoption with customers. A lack of broad adoption may prove a slight sticking point, as AWS SSO has a couple of requirements. You can read the full list in the <a href="https://docs.aws.amazon.com/singlesignon/latest/userguide/prereqs.html">AWS documentation</a>, but in summary, you need to set up AWS Organisations with all organisation features enabled, not just consolidated billing.</p>
<p>One of the lovely setup features about AMG (Grafana) is that you can elect to integrate a handful of services and allow access via IAM. This setup approach enables fast integration to Amazon CloudWatch, Amazon Elasticsearch Service, AMP (Prometheus), Amazon Timestream, AWS X-Ray, AWS IoT SiteWise and Amazon SNS. We can integrate all of these by checking a few boxes when setting up the workspace. Also, to keep your metrics networking entirely within the VPC, AMP (Prometheus) supports <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-and-interface-VPC.html">VPC interface endpoints</a>, preventing metrics from being transmitted over the internet. </p>
<p>AMP’s pricing seems hugely competitive, especially when compared to the soft-costs of managing and supporting Prometheus in an unmanaged way. AWS has provided an example on the <a href="https://aws.amazon.com/prometheus/pricing/">AMP pricing page</a>. Unfortunately the same cannot be said for AMG (Grafana), at $9 per active editor per month the costs will quickly ramp up. An engineering team of twenty engineers able to edit dashboards with ten viewers will mean $230 per month (assuming all are active users). Given Grafana is trivial to set up, and can be done as <a href="https://github.com/grafana/helm-charts/tree/main/charts/grafana">Infrastructure as Code</a> quickly, it may be more cost-effective to stick with a self-managed service.</p>
<p>All in all, it is effortless to use AMG (Grafana) and query metrics once available in AMP (Prometheus). Any existing metrics in Amazon CloudWatch, Amazon Timestream plus others will also be immediately available for querying. Integrating AMP is not as straightforward, and I would recommend using the <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-ingest-metrics-OpenTelemetry.html">ADOT Collector</a> if starting to monitor a new stack. If you have an existing stack, trying out remote writes with your current Prometheus server is a slick way to begin publishing metrics to a highly available managed service. However watch out for <a href="https://prometheus.io/docs/practices/remote_write/">remote write performance</a> issues, and note that you are potentially still vulnerable to a single Prometheus instance failing to scrape. To resolve this, consider running multiple Prometheus servers and utilising <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-ingest-dedupe.html">AMP deduplication</a>. </p>
<p>In summary, if you’re already using Prometheus, I would recommend investigating AMP for your project immediately. It promises to take away most of the pain points of running a Prometheus server — and seems to deliver on those promises. If you’re already using Grafana, and it is a painful installation, give AMG a go, otherwise, stick with what you’re running right now. </p>
<p>Are you using CloudWatch and aren’t yet using anything like Prometheus or Grafana? Unless you have a specific driver for moving towards using Grafana for presenting metrics or Prometheus for storing CloudWatch metrics, I recommend sticking to using CloudWatch directly.</p>
<p>I want to commend the AWS teams involved with AMP and AMG as they appear as excellent services that will significantly improve operational efficiency. Therefore, teams can spend more time focusing on delivering value to their end-users.</p>
<p>Give these services a try, and let me know what you thought — <a href="https://twitter.com/florx">@florx</a> </p>
tag:blog.florxlabs.com,2014:Post/introduction-to-information-security2020-07-04T08:36:46-07:002020-07-04T08:36:46-07:00Introduction to Information Security - Video Script<p>I recently published a video, and one of my first talking-head style videos, on a quick introduction to information security.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/Wld5NkMHLv8"></iframe>
<p>Here’s the original script I wrote to help me record the video.</p>
<h1 id="script_1">Script <a class="head_anchor" href="#script_1">#</a>
</h1>
<p>Hey folks!</p>
<p>I wanted to talk security as part of application development today, as one of my favourite questions to ask in full-stack engineer interviews is about broad security understanding, and I often don’t get great replies. </p>
<p>Security for me is everyone’s responsibility, while your company may a dedicated security team of some sort, the people who are best placed to secure an application or platform, are the people working on it every day. </p>
<p>In this series of videos, I’ll dig a little deeper into security as a whole, and what we can do as engineers and leaders, to help continuously improve it.</p>
<p>My background is as a software engineer, and I’m now leading teams to improve and architect more resilient and secure systems in the cloud.</p>
<p>But first, what am I even talking about? What are the foundations of information security?</p>
<h2 id="cia-triad_2">CIA triad <a class="head_anchor" href="#cia-triad_2">#</a>
</h2>
<p>One of the key concepts is the CIA triad.</p>
<p>Confidentiality. Integrity. Availability. </p>
<p>Confidentiality is only making information available to authorised users, or put another way, not making information available to unauthorised users. </p>
<p>Integrity is ensuring the information is accurate and complete, and cannot be modified by unauthorised users.</p>
<p>Availability is ensuring the information can be accessed when it is needed. Unauthorised users should not be able to prevent users from accessing data.</p>
<p>This is a useful and a key concept when thinking about information, but it’s also important to remember it is focussed fairly heavily on data. For example; an unauthorised user could misuse your hardware, for Bitcoin mining - which would also be a security issue, but not strictly covered by the CIA triad.</p>
<h2 id="security-balancing-risk_2">Security balancing risk <a class="head_anchor" href="#security-balancing-risk_2">#</a>
</h2>
<p>Information security in software systems is all about identifying vulnerabilities and threats to an environment, and deciding what, if any, countermeasures to enact to reduce that risk to an acceptable level.</p>
<p>Security is about balancing risk and access to information. Putting an application on the internet is risky, just about mitigating that risk as much as possible to reduce the possibility of a breach or loss of data.</p>
<h2 id="defence-in-depth_2">Defence in depth <a class="head_anchor" href="#defence-in-depth_2">#</a>
</h2>
<p>Nothing is ever truly secure. I suppose you could lock a computer, in a bunker, with no internet access, behind a hundred bank vault style doors - but with enough time or effort, someone could theoretically break-in—time to get my tin foil hat out.</p>
<p>So one concept we employ is defence in depth. The is multiple layers of security controls, used in conjunction with each other to increase the difficulty of an exploit. </p>
<p>Time for an example, and by no means is this exhaustive: we may want to lock our servers in a secure data centre, with biometrics, CCTV and lots of physical security, and keep the location secret. As many of us are now using cloud providers for our hosting, these are some of the tactics we know they employ. </p>
<p>Next, we can ensure our network security is up to spec, adding firewalls to our systems, only allowing connectivity across systems for the bare minimum to make it work. Especially noteworthy is removing all inbound access from the internet. API gateways are great for this, they can be a proxy and translation layer, so your applications aren’t directly exposed to the internet. We can also look at encrypting our traffic in transit, which will make it harder to eavesdrop and get valuable information.</p>
<p>Using VPNs or a BeyondCorp, ZeroTrust model to access internal systems is an excellent way of adding an extra layer of defence, only allowing privileged users access to specific networks. This also helps keep these networks off the internet. </p>
<p>Then we can look at infrastructure security, the machines our code or applications are running on, ensuring they’re up to date and are running a minimum amount of software necessary for the app to work. We can also encrypt our disks at rest, so our data is less accessible to any unauthorised users. Servers should be regularly scanned for vulnerabilities and malicious software and rectified quickly. This means no more running 2-year-old versions of your favourite Linux distro…</p>
<p>Now we’re down to the application layer, again we’re going to want to check vulnerabilities here, in any of the 3rd party dependencies. A recent report released revealed that 86% of npm and 74% of indirect Java dependencies are vulnerable, with cross-site scripting being most common. Then we’re going to want to review our authentication and authorisation layers, to ensure they’re robust, ensuring our role-based access control works as intended. </p>
<p>We can also check that our secrets are not stored in our code, and are appropriately stored in a centralised secrets management tool.</p>
<h1 id="wrap-up_1">Wrap up <a class="head_anchor" href="#wrap-up_1">#</a>
</h1>
<p>Thanks for watching and/or reading - I’d love to know what you thought.</p>
<p>Let me know on twitter - <a href="https://twitter.com/florx">@florx</a></p>
tag:blog.florxlabs.com,2014:Post/dynamic-secrets2020-05-26T05:30:52-07:002020-05-26T05:30:52-07:00Be rid of database passwords!<p>I recently published a demo of Hashicorp Vault’s dynamic secret feature. </p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/dJuMYpgLeYA"></iframe>
<p>You can follow along with the demo using all the commands here: <a href="https://github.com/florx/secrets-are-hard-demo">https://github.com/florx/secrets-are-hard-demo</a></p>
<p>If you’re more of a text person than a video person - here’s what I covered and why they’re so cool.</p>
<h2 id="what-are-they_2">What are they? <a class="head_anchor" href="#what-are-they_2">#</a>
</h2>
<p>Dynamic secrets are credentials that don’t exist until you request them. Once you request access to a specific system (e.g. a database), the centralised secret management system (in this case Vault) will generate credentials and serve them. </p>
<p>These credentials will be time-limited, and if the requesting system doesn’t check in to say it’s still using them, they will be automatically revoked. This whole process of generation, leasing, and revoking, means any secrets are only available for a very short time and are only kept in memory.</p>
<h2 id="why-is-it-important_2">Why is it important? <a class="head_anchor" href="#why-is-it-important_2">#</a>
</h2>
<p>Static secrets, the more conventional way of hardcoding a set of credentials (e.g. a username and password to a database) are all too easy to share. These credentials are also very difficult to rotate, do you create a new user, make sure you gave them the same permissions, then swap out the username and password. Or do you simply change the password, how do you change the password and config at the same time without any interruption of service?</p>
<p>Dynamic secrets solves a bunch of these problems; credentials are uniquely generated per client that needs (and is allowed) access. Permissions can be tightly controlled and changed easily (the next time it’s requested, a new set of permissions is given out). And there’s no need to think about password rotation; it happens every time the service requests credentials.</p>
<h2 id="but-in-production_2">But in production? <a class="head_anchor" href="#but-in-production_2">#</a>
</h2>
<p>Yes, on my current project, we run this in production for all of our relational databases. We had the classic problem where developers had “borrowed” the username and password from services in production to access the database, to debug issues. So when someone left the project, we needed to roll the passwords - and got stuck in the problem space above. </p>
<p>We tried it out in our development environment for a few weeks, and after we’d tweaked the permissions (we were too aggressive at first, it stopped our migrations from being able to alter tables), once we got it working well - we then rolled it out to all our relational databases across all of our estate. </p>
<h2 id="how-do-i-try-it-out_2">How do I try it out? <a class="head_anchor" href="#how-do-i-try-it-out_2">#</a>
</h2>
<p>In the demo video above, I go through these steps, which are also on my GitHub repository: <a href="https://github.com/florx/secrets-are-hard-demo">https://github.com/florx/secrets-are-hard-demo</a>. To follow along, you may want to clone it, so you have the various docker-compose files and test data.</p>
<pre><code class="prettyprint lang-bash">$ git clone git@github.com:florx/secrets-are-hard-demo.git
</code></pre>
<h2 id="prerequisites_2">Prerequisites <a class="head_anchor" href="#prerequisites_2">#</a>
</h2>
<p>You’ll need a few things first:</p>
<ul>
<li>PostgreSQL command-line tools (<code class="prettyprint">brew install postgresql</code>)</li>
<li>Vault (<a href="https://www.vaultproject.io/downloads.html">https://www.vaultproject.io/downloads.html</a>)</li>
<li>Docker Desktop (<a href="https://www.docker.com/products/docker-desktop">https://www.docker.com/products/docker-desktop</a>)</li>
</ul>
<h2 id="setup_2">Setup <a class="head_anchor" href="#setup_2">#</a>
</h2>
<p>First, we need to start a Vault server to play around with. This instance has basically no security enabled, so please don’t run it in production like this!</p>
<pre><code class="prettyprint lang-bash">$ vault server -dev
</code></pre>
<p>Next up in our environment setup steps, we need a database to test with. I’ve picked PostgreSQL simply because I’m most familiar with it. The docker-compose file <code class="prettyprint">postgres.yml</code> has all the config we need to start it, with default credentials <code class="prettyprint">admin/supersecret</code>.</p>
<pre><code class="prettyprint lang-bash">$ docker-compose -f postgres.yml up
</code></pre>
<p>Almost there with setup, let’s populate the database with some random data, then we can tell if our dynamic secrets and permissions actually worked! Naturally, we wouldn’t normally inline the password into an environment variable, but this makes the command copy/pasteable.</p>
<pre><code class="prettyprint lang-bash">$ PGPASSWORD=supersecret psql -h localhost -U admin -d users -a -f users.sql
</code></pre>
<p>Last bit of setup, we want to set up our Vault CLI so we can use the <code class="prettyprint">vault</code> command and it’ll correctly talk to the one we just setup.</p>
<pre><code class="prettyprint lang-bash">$ export VAULT_ADDR='http://127.0.0.1:8200'
</code></pre>
<h2 id="setup-dynamic-secrets_2">Setup dynamic secrets! <a class="head_anchor" href="#setup-dynamic-secrets_2">#</a>
</h2>
<p>We first need to enable the <a href="https://www.vaultproject.io/docs/secrets/databases">secrets database engine</a>, this will allow us to tell Vault how to access and give us credentials to our database.</p>
<pre><code class="prettyprint lang-bash">$ vault secrets enable database
</code></pre>
<p>Next, we want to give Vault a way to access our database; we tell it what plugin, roles and <code class="prettyprint">connection_url</code> to use, plus some credentials to log in to that database.</p>
<p>This config is how Vault will connect to create and revoke users, for us to access it later.</p>
<pre><code class="prettyprint lang-bash"># Give Vault instructions on how to contact our database
# Note that both allowed_roles=* and sslmode=disable are not secure, and both should not be used in production.
$ vault write database/config/users-database \
plugin_name=postgresql-database-plugin \
allowed_roles="*" \
connection_url="postgresql://{{username}}:{{password}}@localhost:5432/users?sslmode=disable" \
username="admin" \
password="supersecret"
</code></pre>
<p>The last step of configuring Vault is to describe our role. We can have as many roles per database as we’d like. </p>
<p>In this example, I’ve picked a super limited <code class="prettyprint">read-only</code> role. The <code class="prettyprint">creation_statements</code> specify in SQL how to create a new user for this specific database type, and any permissions you wish to assign to that user.</p>
<p>The <code class="prettyprint">default_ttl</code> and <code class="prettyprint">max_ttl</code> define how long the credential is allowed to last before it needs to be renewed (default), and before it is revoked completely (max).</p>
<pre><code class="prettyprint lang-bash"># Create Vault database role for limited read only access to all tables
$ vault write database/roles/read-only-users-database-human-role \
db_name=users-database \
creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN \
PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
default_ttl="1h" \
max_ttl="10h"
</code></pre>
<p>That’s all the config! Now if we need to get access to this database, we can read <code class="prettyprint">database/creds/read-only-users-database-human-role</code>. Vault will automatically connect to our database, then create a brand new username and password for us. This credential will be time-limited to 1 hour initially, but allowed to renew the lease for up to 10 hours. After the 10 hours are up, the user will be removed, and any connections booted from the database.</p>
<pre><code class="prettyprint lang-bash">$ vault read database/creds/read-only-users-database-human-role
</code></pre>
<p>Test it out! You’ll be allowed to <code class="prettyprint">select * from users</code> but not anything that will change the data, e.g. <code class="prettyprint">delete from users</code>.</p>
<pre><code class="prettyprint lang-bash">$ PGPASSWORD=<password-from-above> psql -h localhost -U <user-from-above> -d users
</code></pre>
<h2 id="wrap-up_2">Wrap up <a class="head_anchor" href="#wrap-up_2">#</a>
</h2>
<p>As you’ve seen from the demo, you no longer have to worry about developers, or services having static credentials to databases. Additionally, their access control can be very limited down to codified Vault roles, and it’s effortless to add a new role if there’s a different access pattern.</p>
<p>We run this in production for all of our relational databases and love it. Give it a try and let me know how you get on! <a href="https://twitter.com/florx">@florx</a></p>
tag:blog.florxlabs.com,2014:Post/shut-dev-down-at-night2020-05-24T04:00:41-07:002020-05-24T04:00:41-07:00Shut DEV down... at night?<p>Thinking about shutting down your non-production environments at night, but wondering about the benefits? </p>
<p>Before we started shutting down at night, we were spending a small fortune on hosting environments that simply weren’t used during nights and weekends. Plus we didn’t have a good recovery strategy plan, if our infrastructure failed - so we decided to combine the two, and solve with one.</p>
<p>I’ll talk about what it means, the benefits and how to get started, in your own adventure of shutting down non-prod at night.</p>
<p><a href="https://svbtleusercontent.com/mqmYEQ2gVjfqfnHrqaSabL0xspap.jpg"><img src="https://svbtleusercontent.com/mqmYEQ2gVjfqfnHrqaSabL0xspap_small.jpg" alt="A zoomed in image of finger about to press the on/off button on a MacBook pro"></a></p>
<h2 id="what-do-i-mean-by-nightly-shutdowns_2">What do I mean, by nightly shutdowns? <a class="head_anchor" href="#what-do-i-mean-by-nightly-shutdowns_2">#</a>
</h2>
<p>By a nightly shutdowns, I actually mean terminating all of your non-production cloud servers at a specified time (e.g. 9pm) and then starting new instances again in the morning (7am). You can also leave these environments “off” over the weekend too, by simply not starting the instances on the Saturday and Sunday mornings.</p>
<h2 id="are-you-crazy-that39s-an-insane-amount-of-wor_2">Are you crazy? That’s an insane amount of work per night! <a class="head_anchor" href="#are-you-crazy-that39s-an-insane-amount-of-wor_2">#</a>
</h2><blockquote>
<p>Who’s got time to manually turn everything off as we leave, to turn them all back on again as we arrive in the morning. Also something could go wrong - then we’re left without a development or test environment and it would set the teams back by several days!</p>
</blockquote>
<p>Having nightly shutdowns as a goal will help you and your team achieve a few things that I believe are fundamentals to successfully running in a cloud environment. </p>
<p>Let’s dissect the quote above, “who’s got time”: nobody should be doing this manually, it should be completely automated. A machine tears it down, and a machine brings it back up. Every day. Exactly the same. This level of automation will enable you quickly and reliably roll out changes to the underlying infrastructure, and know that these changes will be rolled out automatically in the morning.</p>
<p><a href="https://svbtleusercontent.com/6rywo6y86YZ76FTciSq4wc0xspap.jpg"><img src="https://svbtleusercontent.com/6rywo6y86YZ76FTciSq4wc0xspap_small.jpg" alt="Image of a yellow digger tearing a wall down. There is debris, and the wall is about to fall over. It's set against a blue cloudy sky."></a></p>
<p>“Something could go wrong”, we hope not, but wouldn’t you want to know that your deployment/automation scripts were broken in your development/testing environments, rather than in production during a disaster recovery scenario? The reality running in a cloud environment is that all instances are at risk of being removed at any time by the cloud provider - or crash due to an underlying hardware failure. It’s important to design for this level of fault-tolerance.</p>
<h2 id="benefits-of-nightly-shutdowns_2">Benefits of nightly shutdowns <a class="head_anchor" href="#benefits-of-nightly-shutdowns_2">#</a>
</h2>
<p>There are three main benefits:</p>
<ol>
<li>Cost saving</li>
<li>Never having to patch instances</li>
<li>Testing fault tolerance</li>
</ol>
<h3 id="cost-saving_3">Cost Saving <a class="head_anchor" href="#cost-saving_3">#</a>
</h3>
<p>This is the easiest to quantify, so here’s a graph of instance costs broken down per hour. You can clearly see the weekdays, and the weekend gaps where we keep our two environments offline.</p>
<p><a href="https://svbtleusercontent.com/54iWoFVUdZvNihqSDJEx1o0xspap.png"><img src="https://svbtleusercontent.com/54iWoFVUdZvNihqSDJEx1o0xspap_small.png" alt="A graph showing spikes during weekdays and flat low spots during the night and weekend, it has two colours signifying a split of environment"></a></p>
<p>There are roughly 730 hours per month, but only 160 working hours (based on an 8hr day, over 20 days). If you allow for a bit of flex in the working times (e.g. 7am - 9pm), it’s 280 hours. That’s 40% of the total monthly hours (and therefore potentially 40% of the cost!).</p>
<h3 id="never-having-to-patch-instances_3">Never having to patch instances <a class="head_anchor" href="#never-having-to-patch-instances_3">#</a>
</h3>
<p>One of my philosophies in life is never to patch instances, so we don’t. We terminate and get new instances every day on non-production (and roll weekly on production). This means we also get the latest image that’s available (ready with the latest updates and security fixes).</p>
<p>Patching an in-place running instance can be risky and time consuming, if you’re updating the kernel - some times this will require a full restart. If restarting anyway why not get a brand new, fresh instance?</p>
<p><a href="https://svbtleusercontent.com/8QGw9PrcgKqfBrX1Xuuk9V0xspap.png"><img src="https://svbtleusercontent.com/8QGw9PrcgKqfBrX1Xuuk9V0xspap_small.png" alt="Image of a yellow neon sign in a window saying the word Fresh"></a></p>
<p>As we also run stateless services, and all the logs are streamed off each server to a centralised service, we don’t need to worry about log rotation or disks filling up. We can terminate disks at the same time as our instances, and get a new one.</p>
<h3 id="testing-fault-tolerance_3">Testing fault tolerance <a class="head_anchor" href="#testing-fault-tolerance_3">#</a>
</h3>
<p>Ever heard of <a href="https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa">Chaos Kong</a>? He’s part of the <a href="https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116">Simian Army</a>, and is the tool that kills an entire region… whilst we’re not completely simulating that with a nightly shutdown, we are going a significant way to testing all our startup scripts and plans.</p>
<p>Everyday we get to check and test (automatically) that all of our services can survive being offline completely for a number of hours, and resume working normally later. They also check that they come back without any human intervention once the servers start up. We can also take measurements to see if we’re slowing down the boot process or speeding it up over time.</p>
<p>I can now sleep at night, knowing that if we need to invoke our disaster recovery procedures, all our startup scripts will work as expected - as the’ve been tested in a real environment every day.</p>
<h2 id="thinking-of-getting-started_2">Thinking of getting started? <a class="head_anchor" href="#thinking-of-getting-started_2">#</a>
</h2>
<p><a href="https://svbtleusercontent.com/f4HshU3Hcch2BU37q6GuD40xspap.jpg"><img src="https://svbtleusercontent.com/f4HshU3Hcch2BU37q6GuD40xspap_small.jpg" alt="Image of a red and pink neon sign saying Game On, against a black background. The words are a surrounded by a neon square boxing them in"></a></p>
<ul>
<li>Check in with your team, and anyone using these environments - find out what times work for them to be offline! This whole idea unfortunately doesn’t work very well if you have an off-shore team on the other side of the world.</li>
<li>Check your instances are completely automated, and need no human to touch them to go from not existing to completely working.
<ul>
<li>This is a huge topic in itself - it depends on your application and environment, we use AWS EC2 to host our applications, and heavily rely on <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html">User Data</a> scripts on our <a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/create-asg.html">auto scaling launch configurations</a>, in order to setup the application to the correct state on boot.</li>
<li>The auto scaling group will maintain a number of instances we ask it to run, and this is where we put the <a href="https://docs.aws.amazon.com/autoscaling/ec2/userguide/schedule_time.html">schedule</a> in some cases, in other cases we run it from a <a href="https://circleci.com/docs/2.0/workflows/#nightly-example">CircleCI schedule</a> to trigger an AWS CLI to do the same job - this is so we can easily run the “scale-up” workflow to bring the environment online.</li>
</ul>
</li>
<li>Manually terminate instances to test it out first, they should come back automatically if using an auto-scaling group. This can give you confidence to try nightly.</li>
<li>One way to automate this is to use an auto-scaling group (or similar). Set a schedule to scale up/down - this will automatically terminate and start new instances at your specified times.</li>
</ul>
<p>Let me know how you get on! <a href="https://florxlabs.com/florx">@florx</a> </p>
<h3 id="credits_3">Credits <a class="head_anchor" href="#credits_3">#</a>
</h3>
<ul>
<li>“Off” <a href="https://unsplash.com/photos/cw_uvISXkCI">Photo by Aleksandar Cvetanovic on Unsplash</a>
</li>
<li>“Demolition” <a href="https://unsplash.com/photos/Jec9wKPvxlc">Photo by Science in HD on Unsplash</a>
</li>
<li>“Fresh” <a href="https://unsplash.com/photos/tvJOR05xJ0o">Photo by Pietro De Grandi on Unsplash</a>
</li>
<li>“Game On” <a href="https://unsplash.com/photos/k_pBB5wJtaU">Photo by 甜心之枪 Sweetgun on Unsplash</a>
</li>
</ul>
tag:blog.florxlabs.com,2014:Post/an-obsessive-commitment-to-automation2020-03-01T04:50:53-08:002020-03-01T04:50:53-08:00An obsessive commitment to automation<p>I have an unapologetic and relentless obsession to automation. It drives the vast majority of my decision making, both in personal and work contexts, and also the culture of the teams I build.</p>
<p>This is a story of how I identified three manual pain points on a project, and completely automated them away. This eradicated both human error, and human laziness. I’ve often found the laziest engineers make for the best engineers, as they’ll find the simplest solution - and if repetitive - automate the task so they only have to do it once.</p>
<p>We aim to automate every (sensible) aspect of our systems, so engineers don’t have to read docs to remember how to get a service to production - the pipeline/code just makes it happen.</p>
<h2 id="automation-problem-1-not-all-repos-have-the-s_2">Automation Problem 1: Not all repos have the same settings <a class="head_anchor" href="#automation-problem-1-not-all-repos-have-the-s_2">#</a>
</h2>
<p>Once you grow past only a handful of repositories, it becomes problematic to keep all the settings the same, and doubly so if you decide to change a setting! We’ve settled on many settings in GitHub in my current project, here’s a quick rundown:</p>
<ol>
<li> Repo Webhooks
<ul>
<li>We add a specific webhook to every repository to maintain labelling standards (semver labels, and ensure a conforming description) </li>
</ul>
</li>
<li>Repo Labels
<ul>
<li>We add labels to every repo, managing the colours so they’re also always consistent (this was added before GitHub supported Organisation Labels, and we’ve kept it so we can exclude one or two repos)</li>
</ul>
</li>
<li>Team permissions
<ul>
<li>All repositories carry the same permissions in our Org, things like <code class="prettyprint">ReadOnly</code>, <code class="prettyprint">Dev</code> for write access and <code class="prettyprint">AllReposAdmin</code> for admins.</li>
</ul>
</li>
<li>Repo Settings
<ul>
<li>We disable issues, projects, and wiki (as we use JIRA for this)</li>
<li>We only allow Squash Merge </li>
</ul>
</li>
<li>Branch Protection
<ul>
<li>We use feature branches, with reviewed PRs to merge to master - meaning we need to protect our master and enforce reviews.</li>
</ul>
</li>
<li>Signing Protection
<ul>
<li>We enforce GPG signed commits on every commit, and this prevents a PR from being merged if it doesn’t comply</li>
</ul>
</li>
</ol>
<p>Phew! That’s a lot of settings for every developer to get right when they setup a repository from scratch. Enter the <code class="prettyprint">repo-conformity</code> tool. I wrote this tool to automate away all the manual clicking GitHub’s UI.</p>
<p>At the time we had 60 or so repos, which all had different settings, but in theory should have all been the same. We now have over 200 but all with identical settings. It’s now open source too: <a href="https://github.com/florx/repo-conformity-enforcer">https://github.com/florx/repo-conformity-enforcer</a></p>
<h2 id="automation-problem-2-circleci-pipeline-drift_2">Automation Problem 2: CircleCI pipeline drift <a class="head_anchor" href="#automation-problem-2-circleci-pipeline-drift_2">#</a>
</h2>
<p>Similarly to the above problem where the GitHub settings were misaligned, which may seem like a trivial problem (until someone accidentally commits and pushes to master…), our CircleCI pipelines were all slightly different.</p>
<p>This is surprising to me given all of our services are the same language, they are built the same way, and they deploy to every environment the same way. Once a PR is merged to master, it will sort out getting that code into production with only a 1-click-promote step after our preproduction environment.</p>
<p>Engineers were simply copy/pasting the <code class="prettyprint">circleci.yml</code> file and using find/replace to make it work with their new shiny service then customising it to make the build work… if this doesn’t sound scalable or maintainable. It wasn’t.</p>
<p>Enter <code class="prettyprint">circleci-templater</code>, this tool takes an inventory (our list of repositories, and what type they are), and will generate an appropriate <code class="prettyprint">circleci.yml</code> file based on that info. It also creates a new branch, checks in the change and pushes it - ready for a review!</p>
<p>This means every one of our pipelines across our 60+ deployable service repos are identical, and operate exactly the same. It also makes it super easy to roll out a new step, allowing us to add new security/vulnerability scans, automated SAST/DAST with only a few lines of change in the template, and it will create PRs for all repos that need changing.</p>
<p>This one isn’t ready to be open sourced yet as it’s tightly coupled to our specific environment - but I’m working on it.</p>
<h2 id="automation-problem-3-forgetting-to-promote-to_2">Automation Problem 3: Forgetting to promote to production <a class="head_anchor" href="#automation-problem-3-forgetting-to-promote-to_2">#</a>
</h2>
<p>Remember that “1-click-promote to PROD” from earlier, well that also causes problems. We’ve introduced a manual step! Humans are terrible at both remembering about manual steps, but also executing them. Machines on the other hand, they never forget.</p>
<p>We started to get deployment drift between staging and production over time, with a lack of visibility in to which services had been promoted, and which hadn’t. The CircleCI master workflow would tell us, as it would be <code class="prettyprint">On Hold</code>, but it was majorly monotonous to go through every service one by one (especially when there are 60). </p>
<p>Enter <code class="prettyprint">prod-release-checker</code>, this little tool does a comparison between a <code class="prettyprint">prod</code> tag and master to ensure it has been released. If not it will report twice a day how many releases behind a given service is. </p>
<p>It ends up looking like this, service name, number of releases behind and the last person to touch the repository. We don’t do blame, but this is an excellent way to keep people accountable for releasing their own changes to PROD. After all, it’s not <em><u>done</u></em> until a customer can use it!</p>
<pre><code class="prettyprint">service-a is behind by 1 releases (florx)
service-b is behind by 1 releases (some-other-user)
service-c is behind by 3 releases (dependabot)
</code></pre>
<p>(Psst, this one is OSS too! <a href="https://github.com/florx/github-check-prod-releases">https://github.com/florx/github-check-prod-releases</a> - or read more about it <a href="https://blog.florxlabs.com/going-fast-and-keeping-track-of-releases">https://blog.florxlabs.com/going-fast-and-keeping-track-of-releases</a>)</p>
<h1 id="wrap-up_1">Wrap up <a class="head_anchor" href="#wrap-up_1">#</a>
</h1>
<p>Automation, and building in tools to ensure consistency across your different services and infrastructure allows you and your team to move much faster. If you <u>know</u> everything works in an identical way, you can quickly dismiss many debugging steps when something goes wrong (like a CI build).</p>
<p>It also allows you to experiment and introduce new features in the dev team really quickly. We added docker vulnerability scanning into our pipeline within an hour, because we could roll out a change to every pipeline in only a few clicks.</p>
<p>Let me know what you’ve automated! - <a href="https://florxlabs.com/florx">@florx</a></p>
tag:blog.florxlabs.com,2014:Post/youre-using-the-word-devops-badly2019-11-10T07:35:18-08:002019-11-10T07:35:18-08:00You’re using the word DevOps badly<p>I recently attended a conference, designed to bring together security and engineering teams, I was struck by how many people use the word DevOps in different ways, and to me - badly, wrongly and out of context.</p>
<h2 id="quotdevopsquot-used-badly_2">“DevOps” used badly <a class="head_anchor" href="#quotdevopsquot-used-badly_2">#</a>
</h2>
<ul>
<li>I heard people ask, are you DevOps?</li>
<li>Do you do DevOps?</li>
<li>Recruiters hiring for a DevOps engineer</li>
</ul>
<h2 id="what-does-devops-mean-to-me_2">What does DevOps mean to me? <a class="head_anchor" href="#what-does-devops-mean-to-me_2">#</a>
</h2>
<p>DevOps is modern, cultural approach to engineering. To understand this more, we need understand historically what we did before this idea came to be.</p>
<p>Traditionally engineering teams were split, into those who wrote code, and those made sure production carried on working. These teams were also known as “Developer” teams and “Operations” teams.</p>
<p>This produced some communication and handoff difficulties between the two styles of teams, as when the Developer team was ready to run something in production, they would “kick the code over the wall” and hand it to the Operations teams. The Operations team would rightly want to understand whether the various non-functional testing had happened, then would also need to make the various changes to make it run their production environment.</p>
<p>The way the teams were setup was to encourage the Development team to want to ship as quickly as possible, and for Operations to resist as much change to production as possible. This would naturally cause tension, as Operations would want to allow as few changes as necessary - this previously being understood to reduce the risk of an outage.</p>
<p>Change packages would then be bundled up by the Development team in to large changes, and handed off to the Operations team to be deployed at an undetermined time.</p>
<p><a href="https://amazon.co.uk/Phoenix-Project-Devops-Helping-Business/dp/1942788290">The Phoenix Project</a> really solidified this idea for me, originally published in 2013, it still is very relevant in our teams today.</p>
<h2 id="what-does-devops-mean-to-me-attempt-2_2">What does DevOps mean to me, attempt 2 <a class="head_anchor" href="#what-does-devops-mean-to-me-attempt-2_2">#</a>
</h2>
<p>DevOps is a cultural approach to engineering teams. It starts by bringing Operations and Developer teams into the same team, breaking down the silos. This new team, can start to be cross-functional. Beginning to understand and estimate each others tickets, talk to each other regularly in ceremonies (retrospectives, stand-ups, refinement etc).</p>
<p>Eventually, this one team will understand both parts of what is required to build an application, writing code but also shipping it to production. The mindset will move towards thinking about how to run it, before writing a single line of code. These requirements will feed into the initial design and implementation phases, with a priority given to shipping regularly, quickly but safely.</p>
<p>This can all come together through significant automation and collaboration, more on CI/CD automation another time… but here’s <a href="https://blog.florxlabs.com/going-fast-and-keeping-track-of-releases">something I wrote earlier about it</a>.</p>
<p>Therefore, you can’t “do DevOps” it’s a way of working, not a check list item. You can’t hire a “DevOps engineer” as it’s cultural. The most appropriate term would be “Full Stack engineer”, where this to me means frontend, backend and platform/infrastructure.</p>
<p>Thoughts? I’m <a href="https://florxlabs.com/florx">@florx</a> on twitter.</p>
tag:blog.florxlabs.com,2014:Post/going-fast-and-keeping-track-of-releases2019-02-10T11:34:22-08:002019-02-10T11:34:22-08:00Going fast & keeping track of releases<p>It’s important in our world to release as often as possible to production. This means we get to move quickly, try new things, and then get results for the business quickly.</p>
<p>One of the issues with moving quickly can be that, with increased automation, there can be more reliance on tools and it’s easy to forget to approve a production deployment.</p>
<p>We use a pipeline that looks something like this:<br>
<a href="https://svbtleusercontent.com/jVLkb6BDEU4FX1KjR316aw0xspap.png"><img src="https://svbtleusercontent.com/jVLkb6BDEU4FX1KjR316aw0xspap_small.png" alt="Release CI Pipeline.png"></a></p>
<p>We’re doing Continuous Integration and Delivery, but not quite Continuous Deployment. We continuously deploy to our lower environments, but not production quite yet. As you might expect, our master branch is deployed to our development and pre-production environment automatically, but can sometimes get stuck at the manual approval step.</p>
<p>In order to check where we’re at, I’ve written and Open Sourced a quick script that will check how far behind production is on each repository. Take a look here: <a href="https://github.com/florx/github-check-prod-releases">https://github.com/florx/github-check-prod-releases</a> </p>
<p>On my current project we follow a fairly strict, “if it’s on master - it should be in production, unless there’s a problem” rule. Coupled with tagging our repositories when we release to any given environment - this allows us to check easily the status of any given repository.</p>
<p>If the master branch on any repo is behind production by even 1 commit, then we should look to release it. We also do squash merges, so also we know that each commit is 1 pull request of content.</p>
<p>Good luck!</p>
tag:blog.florxlabs.com,2014:Post/versioning2019-02-03T13:00:21-08:002019-02-03T13:00:21-08:00What's in a version?<p>Building distributed systems in using a microservice pattern is hard. At my company we’re always looking for ways to automate any manual processes, or anything that is difficult. Computers don’t make mistakes, but humans aren’t infallible. The more we can rely on a machine-led process, the more reliable a release process can be. This is the journey of versioning and releasing for one of our projects.</p>
<h1 id="in-the-beginning_1">In the beginning… <a class="head_anchor" href="#in-the-beginning_1">#</a>
</h1>
<p>We’re using Kubernetes, with Helm to manage our deployments. We started by using the SHA1 Git hash as the version number, it’s pretty unique and means there’s no manual intervention required by a developer to “bump” a version number. It’s automatically “bumped” by git every time we squash and merge a pull request to master!</p>
<p>Each of those merges to master, triggers an automatic build and push the code to both the dev and preprod environments, it then waits for approval before pushing to production.</p>
<p>This was great and served us well for a long time, there were a few minor issues with this process. We were interchangeably using <code class="prettyprint">helm</code> and <code class="prettyprint">kubectl</code> to do our deployment then rollback automatically if there was a problem with the healthcheck/deployment. This meant that the tiller state (the server component of helm) would think a specific version was already deployed if we tried to rerun the deployment after a rollback.</p>
<p>Not good.</p>
<h1 id="new-approach_1">New approach <a class="head_anchor" href="#new-approach_1">#</a>
</h1>
<p>We overhauled our deployment mechanism, this time fully relying on helm for deployments and rollbacks. We use a single base helm chart for all of our images, as we want uniformity. There’s a <code class="prettyprint">helm.yaml</code> file in the root of every repo specifying variables for the chart (such as the image name), and we specify the tag to use as part of deployment pipeline:</p>
<pre><code class="prettyprint">helm upgrade --install $NAME --version $CHART_VER --wait \
-f helm.yaml --set image.tag=$VERSION,environment=$ENV \
base-service
</code></pre>
<p>This same command will also let us deploy a brand new service for the first time too with the <code class="prettyprint">--install</code> flag.</p>
<p>We check the output of the above command for a non-zero exit code, if we get one, then we know something went wrong and we rollback to the previous version:</p>
<pre><code class="prettyprint">helm rollback $NAME 0
</code></pre>
<p>Specifying the version number as zero is a special value that actually means “-1” or previous version in helm. You <a href="https://github.com/helm/helm/blob/master/pkg/tiller/release_rollback.go#L79-L82">can find the code here</a> it’s not in the official docs so… your mileage may vary.</p>
<hr>
<p>These changes now allow us to deploy the same change repeatedly until we get a success, without having to do any manual faffing about. This is massively speeds up the engineering teams, but we still have that small issue of version numbers looking like this <code class="prettyprint">c9b8132ef905721c0a1a2a342c5f321c636001ce</code> and <code class="prettyprint">b576368621a067c5b9380b3da8cf7e27dabaa916</code>. Which one is the older just by looking at them? Who knows. We have to ask, rather than just knowing by looking at it. We also have no idea how big the change was! Back to the drawing board.</p>
<h1 id="semantic-versioning-to-the-rescue_1">Semantic Versioning to the rescue <a class="head_anchor" href="#semantic-versioning-to-the-rescue_1">#</a>
</h1>
<p><a href="https://semver.org/">SemVer or Semantic Versioning</a> is not a new concept, it has been around for years. Most engineers will have come across the triplet pattern of numbers (1.4.0) at one point or another as most major software uses this pattern. The idea is to communicate to both humans and computers the size and severity of a change through categories - major/minor or patch. Each of the numbers represents one of these categories .. e.g. 1.4.0</p>
<p>Each number can be incremented individually but resets each following number to zero (e.g. 1.4.0, with a major increment, would change to 2.0.0. 1.4.0 with a minor increment, would change to 1.5.0).</p>
<p>We decided to move over to this type of versioning, which then presents a new problem. How do we automatically “bump” the version number? We consider something a new release everytime we merge to master. We need input from the developer to tell us what category of change this is, we also now need to decide what these categories mean, and how to implement this automated system.</p>
<h1 id="automated-semver_1">Automated SemVer <a class="head_anchor" href="#automated-semver_1">#</a>
</h1>
<p>We use GitHub and exclusively use Pull Requests to merge to the master branch (once reviewed, built/tested on CI, linted, passed code coverage checks and approved of course). Pull Requests on GitHub have labels, which is a way we can communicate to our build environment (CircleCI in our case) to tell it what kind of change is.</p>
<p>It was possible to leverage GitHub webhooks, and status checks to enforce at exactly one label of the following “major”, “minor” or “patch” on every GitHub PRs. This meant we had input, then we just had to get CircleCI to read from the GitHub APIs and bump the version as required. We also took the time to create a GitHub release and tag the commit properly now we have this new version.</p>
<p>This is great and we rolled out the change so every PR now required a tag, next some valid questions. Is this change a <code class="prettyprint">minor</code> or a <code class="prettyprint">patch</code> change? Should it be <code class="prettyprint">major</code> as I’ve changed X thing?</p>
<h1 id="what-do-the-categories-mean_1">What do the categories mean? <a class="head_anchor" href="#what-do-the-categories-mean_1">#</a>
</h1>
<p>We published the below internally as a living document, the idea being that if we come up against a scenario that isn’t covered, we can discuss as a team, categorise then add to the list.</p>
<h2 id="major-change_2">Major change <a class="head_anchor" href="#major-change_2">#</a>
</h2>
<p>Increased when the change is an incompatible API change.</p>
<p>Examples:</p>
<ul>
<li>Dropping a field,</li>
<li>adding a required parameter,</li>
<li>breaking change to a public facing domain/library model object,</li>
<li>or renaming a REST resource or field,</li>
<li>breaking business logic changes,</li>
<li>any change requiring a update/addition to Vault</li>
</ul>
<p>Are all examples of breaking changes.</p>
<h2 id="minor-change_2">Minor change <a class="head_anchor" href="#minor-change_2">#</a>
</h2>
<p>Increased when the change is feature addition that is backwards-compatible.</p>
<p>Examples:</p>
<ul>
<li>Adding a new feature,</li>
<li>REST endpoint,</li>
<li>adding non-required parameter,</li>
<li>adding a new domain/library model object,</li>
<li>changing the POM/dependencies/settings/application config</li>
<li>changing the CI pipeline.</li>
</ul>
<p>Are all examples of new features, that are non breaking, backwards compatible changes.</p>
<h2 id="patch-change_2">Patch change <a class="head_anchor" href="#patch-change_2">#</a>
</h2>
<p>Increased only when it’s a backwards-compatible bug fix.</p>
<p>Examples:</p>
<ul>
<li>Changing a comparison in an if statement for a bug fix,</li>
<li>fixing a typo / adding to a README,</li>
<li>adding/improving logging,</li>
<li>general one line changes, small bug fixes that don’t feature any of the characteristics above.</li>
</ul>
<p>Are all examples of a PATCH change, which are small, self contained changes, consisting of only backwards compatible bug or small improvements (but not feature additions)</p>
<h1 id="wrap-up_1">Wrap up <a class="head_anchor" href="#wrap-up_1">#</a>
</h1>
<p>Allowing us to deploy, rollback automatically and use human readable versioning has allowed the team to move faster. Classifying every change also allows us to have visibility of the risk of each change, and we can begin to make better decisions around API versioning and such in the coming weeks.</p>
tag:blog.florxlabs.com,2014:Post/what-the-ops-is-noops2018-06-29T06:55:31-07:002018-06-29T06:55:31-07:00What the Ops is NoOps?<p>Our industry is not shy about inventing new terms to describe what new things they’re doing.</p>
<p>We have DevOps, which is all about the culture shift from the old style of writing some code, then throwing it over the wall for some guy to run in production for you. As the Project Phoenix novel outlines, this rarely works, so DevOps is about moving the development team and the operations team closer together, to improve communication and collaboration, allowing deadlines to be shared rather than fought over.</p>
<p>For me, NoOps takes “moving the dev and ops teams closer” to the next level. Every developer should deploy, and regularly.</p>
<p><a href="https://svbtleusercontent.com/vqTA4E9Hoe7RaoB9shMoPZ0xspap.jpg"><img src="https://svbtleusercontent.com/vqTA4E9Hoe7RaoB9shMoPZ0xspap_small.jpg" alt="Photo by Thomas Kvistholt on Unsplash"></a></p>
<h1 id="so-what-is-noops_1">So what is NoOps? <a class="head_anchor" href="#so-what-is-noops_1">#</a>
</h1>
<p>For me, it’s important that Developers don’t speak to Ops anymore. That’s because they are the Ops team too. There’s no dedicated Ops person on my teams, each individual developer has the knowledge and access to do deployments as regularly as possible.</p>
<p>The teams let the pipelines do all the heavy lifting, I’m talking about letting your Continuous Integration (CI) tool do all the hard work. When a branch in git is code reviewed, approved, and subsequently merged, get your CI to build your releasable artifacts straight away.</p>
<p>This follows the premise that your main or master branch is always releasable, and you have the artifacts built and published to do so immediately.</p>
<p>As part of automating as much as possible, you could think about deploying to your development/test environments from your CI pipeline. Welcome to Continuous Delivery to the dev environment!</p>
<h1 id="the-enterprise-challenge_1">The Enterprise Challenge <a class="head_anchor" href="#the-enterprise-challenge_1">#</a>
</h1>
<p>Is Continuous Delivery, and NoOps even possible in an Enterprise environment? In short, yes, but it can be tricky.</p>
<p>First you’ll want to own your own infrastructure as a development team, this will allow you to move quickly, try new things out, and implement change fast. The easiest way is to leverage the cloud, as trying to procure and deliver hardware into an on-premise situation will unlikely end up with your team being able to own the full infrastructure.</p>
<p>The Cloud is probably a new-ish venture for most Enterprise level companies, so there will likely be teams of people (DBAs, Networking, SysAdmins) who will require taking control of parts of your infrastructure.</p>
<p><strong>Resist.</strong></p>
<p>Explain to those teams, that you’re reducing the load on their team. It’s usually easier to ask forgiveness than it is to get permission, when done responsibly. Also make sure you and your team contribute back to any internal shared repos made by those teams, as this will help build the relationship.</p>
<p>The one team I have failed to mention is Enterprise Security. Their primary purpose is to keep the environment safe and secure. Help them to help you, be engaging early and often on your project. Don’t just throw the rugby ball of code when it’s done and expect an immediate answer.</p>
<p><a href="https://svbtleusercontent.com/tB7Fo92E6L4GVw2z7b7Bcy0xspap.jpg"><img src="https://svbtleusercontent.com/tB7Fo92E6L4GVw2z7b7Bcy0xspap_small.jpg" alt="Photo by Quino Al on Unsplash"></a></p>
<p>You can further improve this “often” part, by starting down the path of <del>DevSecOps</del> NoSecOps. This in the first instance means adding a security tool into your CI pipeline, preferably a tool that Security are already familiar with and use the output of to analyse your application.</p>
<h1 id="what-we-learned_1">What we learned <a class="head_anchor" href="#what-we-learned_1">#</a>
</h1>
<p>When you’re working in an Enterprise environment, sometimes other teams don’t always move at the same speed as your own team. This can be frustrating and can lead to delays and missed deadlines if not quickly resolved. Resolve by engaging earlier, escalating quicker and building relationships!</p>
<p>The more you manage to automate and put into your pipeline, the easier your life will be come. Deployments will be come a breeze! Remember that you don’t need to make the automation perfect on the first try, iterate and make it better over the course of several weeks/months.</p>
<p>So give it a go, reduce your reliance as a Dev team on the Ops team, understand your application architecture, how it runs, and learn how deployments are done. Then start to do deployments, have a say in the architecture, and design your application to run better in the environment you have (or improve it!)</p>
<p>Let me know how your NoOps journey goes - <a href="http://www.twitter.com/florx">@florx</a></p>
<p>Thanks to Thomas Kvistholt and Quino Al for their photos on <a href="https://unsplash.com?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>.</p>