So far I've described the components that form the Revenues part of the PILS formula. In this and subsequent posts I'll describe the components that form the MTB Costs side, starting with the Cost of Production bugs (CPB).
Measuring the costs of fixing production bugs is important because CPB reduces the overall profitability of IT live systems and it's an indicator of the IT deliverables quality.
For these reasons, our goal as IT leaders should be to reduce CPB. How could we go about it? In order to answer this question we first need to identify the causes that lead to a high number of production bugs.
Production bugs are a direct function of the quality of production deliveries. So we need to understand why IT generates poor quality products. Some of the most common reasons are:
- Focus on the wrong targets. Too many teams focus only on delivering to production without paying too much attention to the quality of what's delivered. More specifically, speed over quality is often the path chosen by many IT managers and, subsequently, developers.
- Too much work in progress. The more functionalities are delivered into production at any single time, the higher the chances of introducing bugs. This is advocated at large also by Kanban practitioners: one of the key actions in Kanban is to reduce Work In Progress (WIP)
- Lack of (automated) testing. The lack of testing is not only a symptom of the lack of a safety net when applying changes to our systems, but also of a lack of good design. If we don't secure our IT systems with an automated test suite, we increase the chances of introducing new bugs every time we add or change a feature.
- Inadequate IT methodology for requirements gathering. Partially related to too much WIP, if we try to gather all requirements up-front because, say, we're using a pre-emptive methodology such as Waterfall, the risk of misunderstanding the business requirements, therefore introducing production bugs as functionality gaps, highens. This is also known as the problem of Early Commitment; because the project needs to move from one SDLC phase to the next and in order to do so it needs a sign-off, in pre-emptive methodologies stakeholders at various stages of the SDLC are asked to commit early.
- Lack of development best practices. A wake-up call mainly introduced by XP, best development practices aim at delivering quality products. Amongst them we find Test Driven Development (TDD), Clean Code, Continuous Delivery, Continuous Integration environments, the use of Source Code Management (SCM) tools and DevOps. Sometimes developers are simply not aware of best development practices and although this lack of knowledge doesn't automatically introduce production bugs, the use of best development practices is widely recognised as one of the main tools to increase the quality of production deliveries.
Before identifying how we, as IT leaders, can help increase the quality of production deliveries and therefore reduce production bugs, I'd like to take a brief detour and state what for many will be obvious. Why are production bugs costly?
The following graph might illustrate why:
Numerous research showed how the cost of defects increases exponentially as we move along the timeline in the Software Development Life Cycle (SDLC).
A defect found early in the development lifecycle is significantly cheaper to deal with than a bug found in production. The knowledge surrounding the issue is fresh in the mind of those who developed the functionality and, if found before hitting production, the fix doesn't need to go through a production release, which usually involves considerable overhead and related costs.
- If a defect is found during development, there aren't any additional infrastructural costs (project ceremony costs) involved in fixing it, other than the time required to write a failing test and the subsequent fix.
- If a defect is found after a product has been deployed to production, the knowledge on that product is not fresh in the developer's mind. Depending on the code quality, finding the root cause might be quick or might take a significant amount of time. However, even if found quickly, the ceremony associated with setting up the environment for a fix takes significant time and therefore costs money. Typically, when fixing a production bug, the development environment needs to be setup, the fix needs to be developed then deployed to QA for QA sign-off and to UAT for business sign-off and finally, it is deployed to production (with all the bureaucracy that this requires).
Production bugs indicates also hidden costs: in those organisation without a team dedicated to production bug fixes, someone has to stop working on business valuable deliverables to fix malfunctions. Where a dedicated team is available (what in my book on <ALT+F> I describe as Maintain The Business - MTB team) an IT organisation is paying development and staff costs to fix what should have worked in the first place.
Because production bugs are costly, when possible often organisations tend to adopt workarounds.
Workarounds don't remove the MTB costs associated with production bugs; they defer them indefinitely, therefore contributing to a continuous cash outflow to implement them.
Let's think for a moment what happens with workarounds. When an issue occurs in production, a user typically flags it with production support. Depending on the maturity and size of the organisation, this triggers a whole series of activities. In small organisations, it may be a phone call to the IT manager's blackberry; in enterprise organisations, an incident might be raised through an electronic system and flash messages sent to emails and blackberries of various interested stakeholders. The people in first line support, who are probably the first port of call, will either rely on memory to remember whether this is a recurring issue, or, in the best of cases, will scan their knowledge base to check whether this problem has occurred before.
In the best case scenario, they'll have a procedure to follow for implementing the workaround; by experience, this typically consists in raising an incident, running some SQL scripts in the UAT environment to simulate the incident, checking whether the fix worked, and finally applying it to production, only being 1,000 times more careful than UAT as this is production after all. Eventually the issue is fixed, the user notified, the incident closed and business is back to normal...Until the next, identical production issue occurs.
In the worse case scenario, nobody remembers seeing this problem before and there's no knowledge base, therefore 2nd or 3rd line support (typically development) will need to jump out of bed, connect to the office and investigate the problem. Depending on the type of organisation, the pressure can be as low as "Don't worry, we can fix this tomorrow morning" or as high as "Fix the damn thing, we're losing money!". One might argue that in the latter case, the business would probably have opted for a permanent fix, true, but that's not always the case. Once the poor unfortunate 3rd line support developer finds the problem after a few hours of debugging, they notice a line of code, buried deep in some nested function, with a small comment on it: "//This is a known bug - the business chose a workaround. Run the SQL script documented at http://ourbizwiki.com/workarounds/wknds-171.htm", at which point they cry both tears of joy because there's a solution to the problem, and tears of rage as they could have slept a few hours more.
Let's analyse for a moment what happens in both the best and worse case scenarios: for a recurring issue, a few people had to use their time to fix it, maybe some people had to jump out of bed, maybe the company lost some money. The obvious question is: wasn't this avoidable? The obvious answer is, yes, it was, it just needed a permanent fix.
OK, so far we've ascertained that production bugs are costly and workarounds are ineffective from a cost perspective. They also represent a cost in terms of social capital. Both sides of the fence, i.e. the business and developers, won't be happy in an organisation experiencing a high number of production bugs. People will eventually get tired and leave.
For all the reasons above, one of our priorities as IT leaders should be that of increasing the quality of what our IT teams deliver, by choosing an IT methodology that enhances the ability to delay commitment and by placing the business at the centre of the process. So, how do we go about it?
In my book I suggest a possible approach that consists of triggering a transformation strategy within the organisation. Such transformation operates on two levels: at the operational level, i.e. IT, we can adopt ScruXBan, an IT methodology which combines Scrum, XP and Kanban. I describe this methodology in detail in two of my posts (Google ScruXBan).
At the organisational level, we need to lead a cultural shift to educate the business and IT to work in Agile and Lean environments.
If we apply the strategy right, we'll then solve all the problems described above:
- Focus on the wrong target. ScruXBan promotes the focus on quality as a pre-requisite for speed.
- Too much work in progress. The Kanban side of ScruXBan leads to a reduction of WIP.
- Lack of (automated testing). The XP side of ScruXBan introduces development best practices such as TDD and the importance of an automated test suite as both a good API design tool and a safety net for refactoring and exploring activities.
- Inadequate IT methodology for requirements gathering. ScruXBan is the combination of Agile (Scrum, XP) and Lean (Kanban) methodologies which are known for adopting an Iterative and Incremental Delivery (IID) approach. This ultimately boils down to gathering in detail and delivering only the highest priority requirements in a continuous delivery cycle. ScruXBan eliminates gates, thus frees stakeholders at all levels of the SDLC from the problem of Early commitment.
The <ALT+F> framework suggests a simple template (available here) to record CPB.
The template is just a guideline. Ideally, you would have access to some automated tool to extract the data automatically. The key ideas when recording CPB are to keep track of the hours, location and category of each production bug.
Hours and locations (i.e. on/off shore) allow us to keep track of costs. Categories allow us to identify the legacy systems currently in the worse shape and how much we're spending on them, eventually allowing us to provide the business with a business case for a long term solution.