Risk free software

[tweetmeme source=”activelylazy” only_single=false]

Nobody wants to make mistakes, do they? If you can see something’s gonna go wrong, its only natural to do what you can to prevent it. If you’ve made a mistake once, what kind of idiot wants to repeat it? But what if the cure is worse than the problem? What if the effort of avoiding mistakes is worse than what you’re preventing?

Preventative Measures

So you’ve found a bug in production that really should have been caught during QA; you’ve had a load-related outage in production; you find a security issue in production. What’s the natural thing to do? Once you’ve fixed the immediate problem, you probably put in place a process to stop similar mistakes next time.

Five whys is a great technique to understand the causes and make appropriate changes. But if you find yourself adding more bureaucracy, a sign-off to prevent this happening in future – you’re probably doing it wrong!

Unfortunately this is a natural instinct: in response to finding bugs in production, you introduce a sign-off to confirm that everyone is happy the product is bug-free, whatever that might mean; you introduce a final performance test phase, with a sign-off to confirm production won’t crash under load; you introduce a final security test, with a sign-off to confirm production is secure.

Each step and each reaction to a problem is perfectly logical; each answer is clear, simple and wrong.

Risk Free Software

Let’s be clear: there’s no such thing as risk free software. You can’t do anything without taking some risk. But what’s easy to overlook, is that not doing something is a risk, too.

Not fixing a bug runs the risk that its more serious than you thought; more prevalent than you thought; that it could happen to an important customer, someone in the press, or a highly valued customer – with real revenue risk. You run the risk that it collides with another, as yet unknown bug, potentially multiplying the pain.

Sometimes not releasing feels like the safest thing to do – but you’re releasing software because you know something is wrong. How can not changing it ever be better?

The Alternative

So what you gonna do? No business wants to accept risk, you have to mitigate it somehow. The simple, easy and wrong thing to do is to add more process. The braver decision, the right decision, is to make it easy to undo any mistakes.

Any release process, no matter how retarded, will normally have some kind of rollback. Some way of getting back to how things used to be. At its simplest, this is a way of mitigating the risk of making a mistake: if it really is a pretty shit release, you can roll it back. Its not great, but it gives you a way of recovering when the inevitable happens.

But often people want to avoid this kind of scenario. People want to avoid rolling back; to avoid the risk of a roll back; totally missing the point that the rollback is your way of managing risk. Instead, you’re forced to mitigate the risk up front with bureaucracy.

If you’re using rollback as a way of managing risk (and why wouldn’t you?), then you’d expect to rollback from time to time. If you’re not rolling back, then you’re clearly removing all risk earlier in the process. This means you have a great process for removing risk; but could you have less process and still release product sometime this year?

Get There Quicker

Being able to rollback is about being able to recover from mistakes quickly and reliably. Another way to do that is to just release solutions quickly. Instead of rolling back and scheduling a fix sometime later, why not get the fix coded, tested and deployed as quickly as possible?

Some companies rely on being able to release quickly and easily every day. Continuous deployment might not itself improve quality; but it improves your ability to react to problems. The obvious side-effect of this is that you can fix issues much faster, so you don’t spend time before a release trying to catch absolutely everything. Instead by decreasing the time between revisions, by increasing your velocity, you create a higher quality product: you just fix issues so much faster.

Continuous deployment lets you streamline your process – you don’t need quite so many checks and balances, because if something bad happens you can react to it and fix it. Obviously, you need tests to ensure your builds are sound – but it encourages you to automate your checks, rather than relying on humans and manual sign-offs. Instead of introducing process, why not write code to check you’ve not made the same mistake twice?

Of course, the real irony in all this, is that the thing that often stops you doing continuous deployment is a long and tortuous release process. The release process encapsulates the lessons from all your previous mistakes. But with a lightweight process, you could react so much faster, by patching within minutes not days, that you wouldn’t need the tortuous process.

Your process has become its own enemy!

Doing agile the traditional way

[tweetmeme source=”activelylazy” only_single=false]

Doing agile is easy. If you’re working on a greenfield project, with no history and no existing standards and processes to follow. The rest of us get to work on brownfield projects, with company standards that are the antithesis of agile, and people and processes wedded to a past long out of date. Oh you can do agile in this environment, but its hard – because you have to overcome the constraints that meant you weren’t agile in the first place.

Any organisation hoping to “go agile” needs to overcome its constraints. The scrum team hits issues stopping them being properly agile – normally artefacts of the old way of doing things – they raise these to the scrum master and management hoping to remove them and reach the agile nirvana on the other side. Management diligently discuss these constraints, the easiest are quickly removed – simple things like better tools, whiteboards everywhere; but some take a bit longer to remove.

Before too long, you run into company culture: these are the constraints that just won’t go away. Not all constraints are created equal, though – the hardest to remove are often the most vital; the things that stop the company being properly agile. So let’s look at some of the activities and typical constraints a team can encounter.

User Story Workshops

If you’re doing agile, you’ve gotta have user stories. If you have user stories, you’re bound to have something like a user story workshop – where all the stakeholders (or just the dev team, if you’re kidding yourself) get together to agree the basic details of the work that needs to be done.

The easiest trap in the world to fall into is to try and perfect the user stories. Before you know it you’re stuck in analysis paralysis, reviewing and re-reviewing the user stories until everyone is totally happy with them. Every time you discuss them, someone thinks of a new edge case, a new detail to think about – so you add more acceptance criteria, more user stories, more spikes.

Eventually you’ll get moving again, but now you’ve generated a mountain of collateral with the illusion of accuracy. User stories should be a place holder for a discussion, if you waste time generating that detail up front you might skip the conversation later thinking you’ve captured everything already and miss the really critical details.

Estimates

The worst thing is when it comes to producing estimates. The start of a new project or a team is a difficult time: you’ve got no history to base your estimates on, but management need to know how long you’ll take. With the best of intentions, scrum masters and team leads are often given incentives – a bonus, for example – to produce accurate estimates; there’s only two ways to play this: 1. pad estimates mercilessly and know you’ll fill that “spare” time easily; 2. spend more time analysing the problem, as though exhaustive analysis can predict the future. Neither of these outcomes are what the business really wants and certainly can’t be described as “agile”.

Design

There’s a great conflict between the need for overall architecture and design; and the agile mentality to JFDI – to not spend time producing artefacts that don’t themselves add value. However, you can’t coordinate large development activities without some guiding architecture, some vague notion of where you’re heading.

But the logical conclusion of this train of thought is that you need some architectural design – so lets write a document describing our ideas and get together to discuss it. Luckily this provides a great forum for all the various stakeholders across the business that don’t grok code to have their say: “we need an EDA“; “I know yours is a small change, but first you must solve some arbitrarily vast problem overnight”; “I’m a potential user of this app and I think it should be green“.

This process is great at producing perfect design documents. Unfortunately in the real world, where requirements change faster than designs can be written, its a completely wasted effort. But it gives everyone a forum for airing their views and since they don’t trust the development team to meet their requirements any other way, any change to the process is resisted.

The scrum team naturally raise the design review process as a constraint, but once its clear it can’t be removed, they start to adapt to it. Because the design can fundamentally change on the basis of one person’s view right up to the last minute (hey! Cassandra sounds cool, let’s use that!); the dev team change their process: “sprints can’t start until the design is signed off”. And because the design can fundamentally change the amount of work to be done, the user stories are never “final” until the design is signed off and the business can’t get an estimate until the design has been agreed.

Sprinting

If you’re lucky, the one part of the project that feels vaguely scrum-y is the development sprint. The team are relatively free of dependencies on others; they can self-organise around getting the work done; things are tested as they go; the Definition of Done is completed before the next story is started. The team are rightly proud of “being agile”.

Manual Testing

And then the constraints emerge again. If you’re working on a legacy code base, without a decent suite of automated tests, you probably rely on manual testers clicking through your product to some written test script. A grotesque misuse of human beings if ever there was one, but the cost of investing in automation “this time” is outweighed by how quickly the testers can whip through a script now they’ve done it a million times.

Then worst of all possible worlds, because regression testing the application isn’t a case of clicking a button (but clicking thousands of buttons, one by one in a very specific order) you have to have a “final QA run”. A chance, once development has stopped, for QA to assure the product meets the required quality goals. If your reliance on manual QA is large, this can be a time-consuming exercise. And then, what happens if QA find a bug? Its fixed, and you start the whole process all over again… like some gruesome nightmare you can’t wake up from!

The Result

With these constraints at every step of the process, we’re not working how we’d like to. Now either we can view them as a hindrances to working the way we want to; or we can view them as forcing a new process on us, that we’re not in control of. What we’ve managed to do is invent a new development methodology:

  • First we carefully gather requirements and discuss ad nauseum
  • Then we carefully design a solution that meets these (annoyingly changing) requirements
  • Then we write code to meet the design (and ignore that the design changes along the way)
  • Finally we test that the code we wrote actually does what we wanted it to (even though that’s changed since we started)

What have we invented? Waterfall. We’ve rediscovered waterfall. Well done! For all your effort and initiatives, you’ve managed to invent a software development methodology discredited decades ago.

But we’ve tarted it up with stand ups, user stories, a burn down chart and some poor schmuck that gets called the scum master – but really, all we’ve done is put some window dressing on our old process and given it a fancy dan name.

If the ultimate goal of introducing agile to your company is to make the business more efficient but you’re still drowning in constraints – then that means the goal of introducing agile has failed: you’ve still got the same constraints, and you’re still doing waterfall – no matter what you call it: a rose by any other name still smells of shit.

Testing asynchronous applications with WebDriver

[tweetmeme source=”activelylazy” only_single=false]

Note: this article is now out of date, please see the more recent version covering the same topic: Testing asynchronous applications with WebDriverWait.

WebDriver is a great framework for automated testing of web applications. It learns the lessons of frameworks like selenium and provides a clean, clear API to test applications. However, testing ajax applications presents challenges to any test framework. How does WebDriver help us? First, some background…

Drivers

WebDriver comes with a number of drivers. Each of these is tuned to drive a specific browser (IE, Firefox, Chrome) using different technology depending on the browser. This allows the driver to operate in a manner that suits the browser, while keeping a consistent API so that test code doesn’t need to know which type of driver/browser is being used.

Page Pattern

The page pattern provides a great way to separate test implementation (how to drive the page) from test specification (the logical actions we want to complete – e.g. enter data in a form, navigate to another page etc). This makes the intent of tests clear:

homePage.setUsername("mytestuser");
homePage.setPassword("password");
homePage.clickLoginButton();

Blocking Calls

WebDriver’s calls are also blocking – calls to submit(), for example, wait for the form to be submitted and a response returned to the browser. This means we don’t need to do anything special to wait for the next page to load:

WebDriver driver = new FirefoxDriver();
driver.get("http://www.google.com/");
WebElement e = driver.findElement(By.name("q"));
e.sendKeys("webdriver");
e.submit();
assertEquals("webdriver - Google Search",driver.getTitle());

Asynchronous Calls

The trouble arises if you have an asynchronous application. Because WebDriver doesn’t block when you make asynchronous calls via javascript (why would it?), how do you know when something is “ready” to test? So if you have a link that, when clicked, does some ajax magic in the background – how do you know when the magic has stopped and you can start verifying that the right thing happened?

Naive Solution

The simplest solution is by using Thread.sleep(…). Normally if you sleep for a bit, by the time the thread wakes up, the javascript will have completed. This tends to work, for the most part.

The problem becomes that as you end up with hundreds of these tests, you suddenly start to find that they’re failing, at random, because the delay isn’t quite enough. When the build runs on your build server, sometimes the load is higher, the phase of the moon wrong, whatever – the end result is that your delay isn’t quite enough and you start assert()ing before your ajax call has completed and your test fails.

So, you start increasing the timeout. From 1 second. To 2 seconds. To 5 seconds. You’re now on a slippery slope of trying to tune how long the tests take to run against how successful they are. This is a crap tradeoff. You want very fast tests that always pass. Not semi fast tests that sometimes pass.

What’s to be done? Here are two techniques that make testing asynchronous applications easier.

RenderedWebElement

All the drivers (except HtmlUnit, which isn’t really driving a browser) actually generate RenderedWebElement instances, not just WebElement instances. RenderedWebElement has a few interesting methods on it that can make testing your application easier. For example, the isDisplayed() method saves you having to query the CSS style to work out whether an element is actually shown.

If you have some ajax magic that, in its final step, makes a <DIV> visible then you can use isDisplayed() to check whether the asynchronous call has completed yet.

Note: my examples here use dojo but the same technique can be used whether you’re using jquery or any other asynchronous toolkit.

First the HTML page – this simply has a link and a (initially hidden) <div>. When the link is clicked it triggers an asynchronous call to load a new HTML page; the content of this HTML page is inserted as the content of the div and the div made visible.

<html>
<head>
 <title>test</title>

 <script type="text/javascript" src="js/dojo/dojo.js" djConfig=" isDebug:false, parseOnLoad:true"></script>

 <script src="js/asyncError.js" type="text/javascript"></script>

 <script>
   /*
    * Load a HTML page asynchronously and
    * display contents in hidden div
    */
   function loadAsyncContent() {
     var xhrArgs = {
       url: "asyncContent.htm",
       handleAs: "text",
       load: function(data) {
         // Update DIV with content we loaded
         dojo.byId("asyncContent").innerHTML = data;

         // Make our DIV visible
         dojo.byId("asyncContent").style.display = 'block';
       }
     };

     // Call the asynchronous xhrGet
     var deferred = dojo.xhrGet(xhrArgs);
   }
 </script>
</head>
<body>
 <div id="asyncContent" style="display: none;"></div>
 <a href="#" id="loadAsyncContent" onClick="loadAsyncContent();">Click to load async content</a>
 <br/>
</body>
</html>

Now the integration test. Our test simply loads the page, clicks the link and waits for the asynchronous call to complete (highlighted). Once the call is complete, we check that the contents of the <div> is what we expect.

@RunWith(SpringJUnit4ClassRunner.class)
public class TestIntegrationTest {
 @Test
 public void testLoadAsyncContent() {
   // Create the page
   TestPage page = new TestPage( new FirefoxDriver() );

   // Click the link
   page.clickLoadAsyncContent();
   page.waitForAsyncContent();

   // Confirm content is loaded
   assertEquals("This content is loaded asynchronously.",page.getAsyncContent());
 }
}

Now the page class, used by the integration test. This interacts with the HTML elements exposed by the driver; the waitForAsyncContent method regularly polls the <div> to check whether its been made visible yet (highlighted)

public class TestPage  {

 private WebDriver driver;

 public TestPage( WebDriver driver ) {
   this.driver = driver;

   // Load our page
   driver.get("http://localhost:8080/test-app/test.htm");
 }

 public void clickLoadAsyncContent() {
   driver.findElement(By.id("loadAsyncContent")).click();
 }

 public void waitForAsyncContent() {
   // Get a RenderedWebElement corresponding to our div
   RenderedWebElement e = (RenderedWebElement) driver.findElement(By.id("asyncContent"));

   // Up to 10 times
   for( int i=0; i<10; i++ ) {
     // Check whether our element is visible yet
     if( e.isDisplayed() ) {
       return;
     }

     try {
       Thread.sleep(1000);
     } catch( InterruptedException ex ) {
       // Try again
     }
   }
 ]

 public String getAsyncContent() {
   return driver.findElement(By.id("asyncContent")).getText();
 }
}

By doing this, we don’t need to code arbitrary delays into our test; we can cope with server calls that potentially take a little while to execute; and can ensure that our test will always pass (at least, tests won’t fail because of timing problems!)

JavascriptExecutor

Another trick is to realise that the WebDriver instances implement JavascriptExecutor. This interface can be used to execute arbitrary Javascript within the context of the browser (a trick selenium finds much easier). This allows us to use the state of javascript variables to control the test. For example – we can have a variable populated once some asynchronous action has completed; this can be the trigger for the test to continue.

First, we add some Javascript to our example page above. Much as before, it simply loads a new page and updates the <div> with this content. This time, rather than show the div, it simply sets a variable – “working” – so the div can be visible already.

 var working = false;

 function updateAsyncContent() {
   working = true;

   var xhrArgs = {
     url: "asyncContentUpdated.htm",
     handleAs: "text",
     load: function(data) {
       // Update our div with our new content
       dojo.byId("asyncContent").innerHTML = data;

       // Set the variable to indicate we're done
       working = false;
     }
   };

   // Call the asynchronous xhrGet
   var deferred = dojo.xhrGet(xhrArgs);
 }

Now the extra test case we add. The key line here is where we wait for the “working” variable to be false (highlighted):

 @Test
 public void testUpdateAsyncContent() {
   // Create the page
   TestPage page = new TestPage( getDriver() );

   // Click the link
   page.clickLoadAsyncContent();
   page.waitForAsyncContent();

   // Now update the content
   page.clickUpdateAsyncContent();
   page.waitForNotWorking();

   // Confirm content is loaded
   assertEquals("This is the updated asynchronously loaded content.",page.getAsyncContent());
 }

Finally our updated page class, the key line here is where we execute some Javascript to determine the value of the working variable (highlighted):

 public void clickUpdateAsyncContent() {
   driver.findElement(By.id("updateAsyncContent")).click();
 }

 public void waitForNotWorking() {
   // Get a JavascriptExecutor
   JavascriptExecutor exec = (JavascriptExecutor) driver;

   // 10 times, or until element is visible
   for( int i=0; i<10; i++ ) {

     if( ! (Boolean) exec.executeScript("return working") ) {
       return;
     }

     try {
       Thread.sleep(1000);
     } catch( InterruptedException ex ) {
       // Try again
     }
   }
 }

Note how WebDriver handles type conversions for us here. In this example the Javascript is relatively trivial, but we could execute any arbitrarily complex Javascript here.

By doing this, we can now flag in Javascript, when some state has been reached that allows the test to progress; we have made our code more testable, and we have eliminated arbitrary delays and our test code runs faster by being able to be more responsive to asynchronous calls completing.

Have you ever had problems testing asynchronous applications? What approaches have you used?

The danger of deadlines

[tweetmeme=https://blog.activelylazy.co.uk/2010/04/27/the-danger-of-deadlines/]

Deadlines are a good thing, right? Everyone needs deadlines. Don’t they?

There are three constants in life: death, taxes and software slipping. I’m doing my best to avoid the first two, but the latter seems here to stay. Software engineering, despite all its advances, has never solved the fundamental question: how long will it take? The advice I got from Nik the development manager on my first day at work after graduation “think of a number, then double it” is still a good rule of thumb.

There are lots of reasons that software slips: engineers are optimistic people, who assume this time round will be better, despite years of evidence to the contrary; the requirements are never really clear, no matter what the product manager tells you; there will always be one innocent little question, 90% of the way through, that suddenly balloons into a metric crapload of work; and everyone’s favourite: what idiot did a half-assed job here, making me finish up his mess – why I oughta!

What’s the consequence of software slipping?

Perhaps you work for one of those companies that don’t care, “you’re done when you’re done; we’ll release when you’re ready”. If so, get outta here, we don’t like your kind round here!

For the rest of us, something has to happen when the schedule slips. Somebody, somewhere has been promised the release by a certain date. All your caveats have been ignored – all that person hears is “It’ll be live by July”. Whether its that big external client; “the business”; or your boss – somebody, somewhere, has an expectation mismatch with reality.

What happens?

Angry ManI don’t have to tell you what happens – project managers, development leads, your boss – suddenly everybody cares about scope/time/resources and has designs on your weekend.

One way or another, projects normally get out. Scope is cut, resources are pointlessly thrown at the project, you lose your weekends. Eventually the project gets out and everyone lives to fight another day. But spare a thought for the real victim here: quality.

Quality?

And you thought it was all about you? Yes – the real victim in all this is the quality of your software.

In a traditional waterfall project, normally by the time you realise you’re late – its too damned late to do anything about it, and you’re screwed. So you do the only thing you can do – you cut time from the end of the project; i.e. persuade QA they don’t need to do as much testing as they estimated. The result? Bugs are missed, crap code gets released to customers – quality has been in a hit and run accident.

In an iterative project you might be a bit more lucky: it might just be one iteration that’s screwed. So you cut from the end of the iteration (there goes the QA again) and try and cut scope from the next iteration to “pull it back”. Unfortunately, all those bugs from the previous iteration are still there, waiting to derail QA and add unexpected development costs later on in the project. And before you know it, you’re back at the scene of the accident again.

At this point the agilistas are all laughing “but that never happens on an agile project!”.

Doesn’t it?

Agile’s different

Downward chartIts true: scrum, kanban and the like are much better at forecasting ahead of time when you’re gonna miss some arbitrary date. You know your velocity, you know how much work is left. You can figure out how many features you can deliver in the time available and cut entire features from the end of the project, without sacrificing QA. Perfect!

But the really insidious problem on agile projects, is that the project manager, scrum master, development lead, even the smart developers all start burndown engineering. Sam the scrum master says,

We’re a bit behind on the burndown chart, Mike, so if you can just get this story finished up today we’ll score the points and get right back on track!

Sounds harmless enough, doesn’t it?

Burndown engineering

Burndown engineering is a terrible crime. By continually trying to engineer each story / feature to adhere to some average you’ve dreamt up, quality dies a thousand small deaths.

I was gonna refactor this code, but I guess I can come back and do it in the next story

Sure, Sam, I was gonna write some more thorough tests; but I can do that in the next sprint  – this story’s done!

The QA guys have been over it already; we really ought to automate some of these tests but we’ve not had time. Shall we add a story to a later sprint to pick this up?

And suddenly, with the best of intentions, Sam has forced technical debt into the project, just in the interests of maintaining his pretty graph. As a one-off, this is fine. The problem is, this happens on every story, for the whole duration of the project. In the end, Sam doesn’t have to say a word – the team are self-policing. They’re trying to pad their estimates to get time to finish everything they need to, but development’s taking longer than expected because everyone’s coming across technical debt from previously half-finished stories.

What’s to be done?

100% quality stampQuality should be a parameter controlled by the development team. Some companies need a high quality bar, other times you can get away with much lower – it depends what kind of industry you’re in. But the development team should set a standard for quality and then stick to it.

Velocity is an emergent property of the team. If you hold quality and resources constant, the velocity of the team is something you measure not something you change. Sure, change the scope or change the schedule (or change the number of people if you believe in man months, mythical or otherwise) – but forcing each unit of work to take a certain amount of time forces quality to be the variable that changes.

But you still need deadlines

Sure, somebody, somewhere needs to know when the release will be ready. But the product owner should be the one that knows the release schedule. Someone whose only lever is scope. If they want it faster, their only option is to reduce scope.

The scrum master or development lead needs to ensure that development continues apace, but that quality is held constant. If the product owner wants it faster – his choice is scope. Leaving developers to focus on doing what they need to get done, without worrying whether they’re maintaining their velocity this week.

Then we can all stop making crap tradeoffs about whether cutting this story short will make the project deliver faster overall (of course not!); or whether we can justify removing this technical debt to make things quicker (of course we can!).

Ever found yourself on the receiving end of this – being pushed to cut corners to maintain some arbitrary deadline? Ever found yourself trying to justify Doing The Right Thing that you know will take longer now, but make things quicker overall? Let’s hear your war stories!

TomTom Go and T-Mobile

Finally picked up my new TomTom Go 730T yesterday. And immediately failed to get the mobile traffic service setup. After much searching on the intarwebs and calling t-mobile problem is solved! Yay!

Problem seems to be that TomTom’s preset for T-Mobile (UK) is just plain wrong. I now have this working fine with my HTC Touch Diamond – but any phone with bluetooth and gprs that is able to browse the internets should work fine.

Solution:

  • Navigate to “TomTom Traffic” from the Main Menu
  • When it asks to setup wireless data service click Yes
  • Rather than select a phone I went for “Other”
  • Select country (UK)
  • Select operator (T-Mobile Contract)
  • It will now faff about trying to connect. At about 60% I see activity on the phone indicating its trying to use GPRS, it then gets to 65% and sits there for an eternity
  • Eventually it says it failed, options are “Other” or “Manual” select Manual
  • Username should be correct: user
  • Password is wrong, it should be: pass
  • DHCP should be automatic
  • DNS should be automatic
  • Number to dial is wrong, it should be *99#
  • Dial string should be correct: at+cgdcont=1,”IP”,”general.t-mobile.uk”
  • Ok that bad boy and with any luck it will manage to connect

 

If you’re still having problems, double check you can browse the internet from your phone; double check bluetooth is working between phone & TomTom. The above worked for me, your mileage may vary.