Longevity of Source Code

Take a look at the code you work in day-to-day. How long has it been there? How old is it? Six months old? A year? Maybe five years old? Ten? Twenty?! How much of the code is old? Less than 10%? Half? Or as much as 90%? Curious to know the answers to these questions I’ve been investigating how long code sticks around.

Software archeology

Work for any company that’s been around for more than a couple of years and there will be source code that’s been there for a while. Writing software in an environment like this is often an exercise in software archeology – digging down into the application is like digging down into an old city, slowly uncovering the past layer by layer.

Once you get past the shiny new containerized micro-services, you start delving into the recent past: perhaps the remnants of the company’s first foray into a Service Oriented Architecture; now a collection of monolithic services with a complex tangle of business logic, all tied together with so much Spring. Dig deeper still and we get back to the EJB era; some long-forgotten beans still clinging on to life, for lack of any developer’s appetite to understand them again. Down here is where the skeletons are.

If it ain’t broke, don’t fix it

What’s so bad about old code? It’s fulfilling some important purpose, no doubt. At least, some of it probably is.

If you look at code you wrote a year ago and can’t see anything to change, you haven’t learnt a thing in the last year

We’re always learning more: a better understanding of the domain, a better understanding of how our solution models the domain, new architectural styles, new tools, new approaches, new standards and new ideas. It is inevitable the code you wrote a year ago could be improved somehow. But how much of it have you gone back and improved recently?

The trouble with old code is that it gets increasingly hard to change. What would happen if a change in business requirements led you all the way down to the Roman-era sewers that are the EJBs? Would you implement the change the way it would have been done a decade ago? Or would you spend time trying to extract the parts that need to change? Perhaps building another shiny, new containerized micro-service along the way? That change isn’t going to be cheap though.

And this is the problem: paying back this “technical debt” is the right thing to do, but it will be slower to change this ancient code than the stuff you wrote last week or last month. The more ancient code you have the slower it will be to make changes, the slower you can develop new features. The worst part of maintaining a long running code base isn’t just paying back the debt from the things we know we did wrong; it’s the debt from things that were done right (at the time), but only now seem wrong.

How old is our code?

I’ve been looking through a variety of source code: some commercial, some open source. Across a variety of languages (Java, C#, Ruby). Generally it seems that most code bases follow a pretty similar pattern:

About 70% of the lines of code you wrote today will still be in head, unchanged, in 12 months time

Perhaps unsurprisingly, code changes most often in the first couple of months after being written. After that, it seems to enter a maintenance mode, where code changes relatively rarely.

image

I found this pretty surprising: after a year around 75% of the code I’ve written is still there. Imagine how much better I understand the problem today. Imagine the since-forgotten design ideas, the changed architectural vision, the new tools and libraries it could be refactored to use today. Imagine how much better every single one of those lines could be. And yet, even in a code base where we’re constantly working to pay back technical debt, we’re barely making a dent in how old the code is.

The how

How did I do this analysis? Thanks to the magic power of git, it is surprisingly easy. I can use git to do a recursive git blame on the entire repo. This lists, for each line currently in head, the commit that introduced it, who made it and when. With a bit of shell-fu we can extract a count by day or month:

git ls-tree -r -rz --name-only HEAD -- | xargs -0 -n1 git blame -f HEAD | sed -e 's/^.* \([0-9]\{4\}-[0-9]\{2\}\)-[0-9]\{2\} .*$/\1/g' | sort | uniq -c

This outputs a nice table of lines of code last touched by month, as of today. But, I can just as easily go back in history, e.g. to go back to the start of 2015:

git checkout `git rev-list -n 1 --before="2015-01-01 00:00:00" master`

I can then re-run the recursive git blame. By comparing the count of lines last checked in each month I can see how much of the code written before 2015 still exists today. With more detailed analysis I can see how the number of lines last touched in a given month changes over time to see how quickly (or not!) code decays.

Conclusion

Code is written to serve some purpose, to deliver some business value. But it quickly becomes a liability. As this code ages and rots it becomes harder and harder to change. From the analysis above it is clear to understand why a code base that has been around for a decade is mostly all archaic: with so little of the old code being changed each year it just continues to hang around, all the while we’re piling up new legacy on top – most of which will still be here next year.

What are we to do about it? It seems to be a natural law of software that over time it only accumulates. From my own experience it seems that even concerted efforts on refactoring make little impact. What are we to do? Do we just accept that changes become progressively harder as source code ages? Or do we need to find a way to encourage software to have a shorter half-life, to be re-written sooner.