Risk free software

[tweetmeme source=”activelylazy” only_single=false]

Nobody wants to make mistakes, do they? If you can see something’s gonna go wrong, its only natural to do what you can to prevent it. If you’ve made a mistake once, what kind of idiot wants to repeat it? But what if the cure is worse than the problem? What if the effort of avoiding mistakes is worse than what you’re preventing?

Preventative Measures

So you’ve found a bug in production that really should have been caught during QA; you’ve had a load-related outage in production; you find a security issue in production. What’s the natural thing to do? Once you’ve fixed the immediate problem, you probably put in place a process to stop similar mistakes next time.

Five whys is a great technique to understand the causes and make appropriate changes. But if you find yourself adding more bureaucracy, a sign-off to prevent this happening in future – you’re probably doing it wrong!

Unfortunately this is a natural instinct: in response to finding bugs in production, you introduce a sign-off to confirm that everyone is happy the product is bug-free, whatever that might mean; you introduce a final performance test phase, with a sign-off to confirm production won’t crash under load; you introduce a final security test, with a sign-off to confirm production is secure.

Each step and each reaction to a problem is perfectly logical; each answer is clear, simple and wrong.

Risk Free Software

Let’s be clear: there’s no such thing as risk free software. You can’t do anything without taking some risk. But what’s easy to overlook, is that not doing something is a risk, too.

Not fixing a bug runs the risk that its more serious than you thought; more prevalent than you thought; that it could happen to an important customer, someone in the press, or a highly valued customer – with real revenue risk. You run the risk that it collides with another, as yet unknown bug, potentially multiplying the pain.

Sometimes not releasing feels like the safest thing to do – but you’re releasing software because you know something is wrong. How can not changing it ever be better?

The Alternative

So what you gonna do? No business wants to accept risk, you have to mitigate it somehow. The simple, easy and wrong thing to do is to add more process. The braver decision, the right decision, is to make it easy to undo any mistakes.

Any release process, no matter how retarded, will normally have some kind of rollback. Some way of getting back to how things used to be. At its simplest, this is a way of mitigating the risk of making a mistake: if it really is a pretty shit release, you can roll it back. Its not great, but it gives you a way of recovering when the inevitable happens.

But often people want to avoid this kind of scenario. People want to avoid rolling back; to avoid the risk of a roll back; totally missing the point that the rollback is your way of managing risk. Instead, you’re forced to mitigate the risk up front with bureaucracy.

If you’re using rollback as a way of managing risk (and why wouldn’t you?), then you’d expect to rollback from time to time. If you’re not rolling back, then you’re clearly removing all risk earlier in the process. This means you have a great process for removing risk; but could you have less process and still release product sometime this year?

Get There Quicker

Being able to rollback is about being able to recover from mistakes quickly and reliably. Another way to do that is to just release solutions quickly. Instead of rolling back and scheduling a fix sometime later, why not get the fix coded, tested and deployed as quickly as possible?

Some companies rely on being able to release quickly and easily every day. Continuous deployment might not itself improve quality; but it improves your ability to react to problems. The obvious side-effect of this is that you can fix issues much faster, so you don’t spend time before a release trying to catch absolutely everything. Instead by decreasing the time between revisions, by increasing your velocity, you create a higher quality product: you just fix issues so much faster.

Continuous deployment lets you streamline your process – you don’t need quite so many checks and balances, because if something bad happens you can react to it and fix it. Obviously, you need tests to ensure your builds are sound – but it encourages you to automate your checks, rather than relying on humans and manual sign-offs. Instead of introducing process, why not write code to check you’ve not made the same mistake twice?

Of course, the real irony in all this, is that the thing that often stops you doing continuous deployment is a long and tortuous release process. The release process encapsulates the lessons from all your previous mistakes. But with a lightweight process, you could react so much faster, by patching within minutes not days, that you wouldn’t need the tortuous process.

Your process has become its own enemy!

Release hell – update

Holy shit – what happened to last week?

Well our release finally made it out on Wednesday, one day late. Only caused a small outage. Oops. That’s the trouble with a release in the early hours – it might be quiet in the UK at that time, but its middle of the day in Australia. Just when is a “quiet time” when your customers are all across the globe?

I really should write up this release as a case study in how not to deploy software. Everything we could have got wrong we did: error prone manual processes, poorly documented byzantine systems, crucial business logic implemented haphazardly. You name it – we fucked it up.

I’ve gotta hand it to the team though, we stuck at it and got the release out in the end. I feel sorry for the poor IS guys that have to do the release, though – a team full of developers bitching and whining at them when they’re the only hands and feet able to do anything. That can’t be any fun at all – so I think we owe IS a few beers after this week.

Last night

Went out last night with Romi, Tim, Jo, Rosy, Jim & Sam for Tim’s birthday. Started off in a really festive bar in Camden – apparently its normally really nice, we obviously just caught them on an off night. So we left there pretty quick and headed off to Joungleurs.

The acts were really good: started off with some crazy Canadian dude and a really funny geordie lad finished up the night – not that I can remember any of their names. Ended up staying on for the disco afterwards. Much drunken dancing ensued. Hehe. Was a good night.

Ended up missing the last train home – for some reason I was certain there was a train at 1:30. Turns out it left at 1. Next train was at 5. So we ended up crashing at Tim’s.

Dragged our sorry arses back to Twyford this morning for bacon sarnies. God’s own hangover cure. So this afternoon I think I’ll sit and veg and watch the F1. Go Lewis!

Release hell

Welcome to “release hell”. When enterprise software gets deployed to production everyone runs for cover.

How we’ve got where we are and still have such a farcical release process is beyond me. Friday was day 3 of our 4 days in staging. On Tuesday, in theory, we go-live. If only we hadn’t spent two and a half days trying to get staging working. It takes a team of 15 people 20 hours to get our as-live environment, working just like live. Incredible. Imagine how much money we’re wasting.

What I don’t understand is why in a company of so many smart, over-achievers has nobody fixed this problem already?

Is it business focus? Are we too busy delivering “business value” to fix things that actually cost us a fortune? That seems impossible – the business case for fixing these things is too obvious.

Is this the limit? Have we just hit some kind of complexity wall? It doesn’t seem right to me. Other companies seem able to manage vast, complex systems – the likes of Ebay and Amazon are still in business, but deal with huge systems.

Is it a lack of ownership? No one person or team owns the whole release process. Development manage their part, then work with the IS teams to deploy it. Each side has their own processes they follow to make their lives easier, inadvertently frustrating the other.

Luckily, we’ve created a new team to tackle this. Perhaps it will work. The general consensus is that nothing will change. Just another committee that will discuss the same old problems we’ve had for years, coming up with new grand strategies that fail to actually change anything.

How do you change processes in a large software organisation?