Nobody wants to make mistakes, do they? If you can see something’s gonna go wrong, its only natural to do what you can to prevent it. If you’ve made a mistake once, what kind of idiot wants to repeat it? But what if the cure is worse than the problem? What if the effort of avoiding mistakes is worse than what you’re preventing?
So you’ve found a bug in production that really should have been caught during QA; you’ve had a load-related outage in production; you find a security issue in production. What’s the natural thing to do? Once you’ve fixed the immediate problem, you probably put in place a process to stop similar mistakes next time.
Five whys is a great technique to understand the causes and make appropriate changes. But if you find yourself adding more bureaucracy, a sign-off to prevent this happening in future – you’re probably doing it wrong!
Unfortunately this is a natural instinct: in response to finding bugs in production, you introduce a sign-off to confirm that everyone is happy the product is bug-free, whatever that might mean; you introduce a final performance test phase, with a sign-off to confirm production won’t crash under load; you introduce a final security test, with a sign-off to confirm production is secure.
Each step and each reaction to a problem is perfectly logical; each answer is clear, simple and wrong.
Risk Free Software
Let’s be clear: there’s no such thing as risk free software. You can’t do anything without taking some risk. But what’s easy to overlook, is that not doing something is a risk, too.
Not fixing a bug runs the risk that its more serious than you thought; more prevalent than you thought; that it could happen to an important customer, someone in the press, or a highly valued customer – with real revenue risk. You run the risk that it collides with another, as yet unknown bug, potentially multiplying the pain.
Sometimes not releasing feels like the safest thing to do – but you’re releasing software because you know something is wrong. How can not changing it ever be better?
So what you gonna do? No business wants to accept risk, you have to mitigate it somehow. The simple, easy and wrong thing to do is to add more process. The braver decision, the right decision, is to make it easy to undo any mistakes.
Any release process, no matter how retarded, will normally have some kind of rollback. Some way of getting back to how things used to be. At its simplest, this is a way of mitigating the risk of making a mistake: if it really is a pretty shit release, you can roll it back. Its not great, but it gives you a way of recovering when the inevitable happens.
But often people want to avoid this kind of scenario. People want to avoid rolling back; to avoid the risk of a roll back; totally missing the point that the rollback is your way of managing risk. Instead, you’re forced to mitigate the risk up front with bureaucracy.
If you’re using rollback as a way of managing risk (and why wouldn’t you?), then you’d expect to rollback from time to time. If you’re not rolling back, then you’re clearly removing all risk earlier in the process. This means you have a great process for removing risk; but could you have less process and still release product sometime this year?
Get There Quicker
Being able to rollback is about being able to recover from mistakes quickly and reliably. Another way to do that is to just release solutions quickly. Instead of rolling back and scheduling a fix sometime later, why not get the fix coded, tested and deployed as quickly as possible?
Some companies rely on being able to release quickly and easily every day. Continuous deployment might not itself improve quality; but it improves your ability to react to problems. The obvious side-effect of this is that you can fix issues much faster, so you don’t spend time before a release trying to catch absolutely everything. Instead by decreasing the time between revisions, by increasing your velocity, you create a higher quality product: you just fix issues so much faster.
Continuous deployment lets you streamline your process – you don’t need quite so many checks and balances, because if something bad happens you can react to it and fix it. Obviously, you need tests to ensure your builds are sound – but it encourages you to automate your checks, rather than relying on humans and manual sign-offs. Instead of introducing process, why not write code to check you’ve not made the same mistake twice?
Of course, the real irony in all this, is that the thing that often stops you doing continuous deployment is a long and tortuous release process. The release process encapsulates the lessons from all your previous mistakes. But with a lightweight process, you could react so much faster, by patching within minutes not days, that you wouldn’t need the tortuous process.
Your process has become its own enemy!