Over at Management Matters, guest blogger Steve Burrows writes of high profile systems failures at Tesco and Barclays in the UK:
These instances, two major private sector failures of customer-facing IT in a week, show us not only that the private sector is not immune to IT failure, but that our biggest corporates with effectively unlimited IT resource working to their own objectives and timescales still don't get IT governance. In both cases one has to ask what went wrong? Where was the testing? Who oversaw it? Who authorised the go live decisions? It seems the private sector still has some lessons to learn about delivering major IT projects successfully, hopefully it will do so more quickly and with less pain than the public sector.
Steve, I don't know about you, but I am yet to work in any large organisation where the problem was a lack of governance. Systems failures don't happen because of governance, or the lack of it. They happen because large systems which connect to other large systems are simply too complex for anyone to be sure of what's going to happen when you turn them on.
Yes, I know that the answer to this is testing. But the real problem is this: getting certainty that things will work out the way they are supposed to requires that you duplicate your production environment down to the very last server, the very last item of data, and find a way to realistically replicate actual user behaviour at the scale the system is supposed to handle. To get 100% certainty, the test environment has to duplicate real life in every detail, and that costs. I don't know any large processing operation that can afford to do this, actually. Most can barely afford to keep their production systems going.
Since you can't be certain whats going to happen when you turn on a new system, management is expected to make a decision on go-live given the incomplete information they have to hand. Sometimes they know enough that there is a pretty low level of risk of something going wrong. Other times they don't but because of internal pressures, they take the chance anyway. Great IT management isn't scared to take a risk when they assess all the benefits versus all the potential things that can go wrong.
Great IT managers also make mistakes in their risk assessment from time to time too.
The role of governance is to provide the information that managers need in order to make their risk decision about a system. Like all information collection activities, it is subject to declining marginal returns. The more you have, the more it costs for incrementally less surety. There is a point at which the additional risk you eliminate by adding a new process or a new reporting regime is more expensive than the downside it is supposed to mitigate.
But there is another dimension to the go-live decision, and that is the non-IT pressure that managers are subjected to. Sometimes, the business downside of failing to go-live at a particular time is so significant that it justifies the risk of things going wrong, even if you don't have surety of the system. Neither Steve nor I know what really went wrong at Tesco and Barclays, but it is reasonable to guess this: those IT managers would have been under substantial pressure to get those systems out the door, and they took a calculated risk.
Steve goes on to imply that this kind of risk taking is management incompetence:
The reputations of all IT workers are demeaned by such failures, the governance and management failings that allowed these errors to occur bring us all into disrepute by association. The harm suffered due to IT failings in large corporates not only affects them and their customers, it taints all of us in the IT profession, fairly or otherwise.
I don't for a moment suggest that it is acceptable that major systems fail. Nor do I think taking stupid risks is sensible management. But since every change to a major IT system is an exercise is risk assessment, things are going to go wrong from time to time. It is simple to blame management when things go wrong, but not so simple to come up with a solution that reduces the chance of failure to zero in a way that's affordable.
Actually, if such a solution existed, I think we'd all have deployed it by now. Steve, if you know something that the rest of us don't, now is the time to share.
In the meantime, the answer is not to have more governance, it is to have proportional governance, and invest in long term complexity reduction. Complexity reduction not a complete solution, of course, but what it does is reduce - over time - the amount of information you have to get before you can safely take a decision to go live.
James: You are freely interchanging 'Governance' & 'IT Governance'. I don't think they are the same from a business perspective.
I still think it is a governance issue (while I agree that management should take decisions on the limited data available to them and yes, it is a risk). Were these systems went through functional testing? Were there proper customer service management training? (from what Steve writes, it doesn't appear so).
The data elements that needs to be monitored by the program managers is still part of governance.
I agree with you that taking risks is not incompetence; but it shouldn't be on ignorance of facts. If the system is not ready for go-live based on testing results, then management should postpone go-live. If they still go ahead it is not risk taking; it is foolishness.
Posted by: Joseph | July 28, 2010 at 09:32 AM
My own view is that projects often fail as a result of poor leadership e.g. the leader is not listening to his troops, and cutting corners or making promises elsewhere that cannot be kept.
Often, the governance regime is just a bureaucratic process with no real impact on decision making.
The public sector has the added challenge of the go-live date being chosen and cast in stone before the actual scope of requirements are known. The delivery scope can sometimes be contractual and difficult to change. Management will often try to deliver at any cost, resulting in a loose-loose outcome.
When the Leader continually burns out the troops and stops listening to his/her staff, morale declines and often the project is lost. The projects I hate the most are where the new leader comes in and the mantra is ‘take no prisoners and deliver at any cost’. This can result in a short term gain, but longer term it is detrimental.
Good governance needs real two-way vertical and horizontal communication and should not just be a bureaucratic process.
Posted by: Steve Law | July 28, 2010 at 09:42 AM
Hi James, picked this up via Twitter, not being the owner of Management Matters. For reference my email is steve@sba.co.im, and I'll take the liberty below of expanding on the discussion.
Governance of IT has two faces, internal and external. The two commonly permit different standards of oversight.
The internal environment is semi-controlled; we deploy new IT within our organisation to our staff, with whom we can communicate easily to manage initial problems & difficulties, implement work-around procedures etc. In this context IT Governance remains within the purview of the CIO / ITD.
The external environment is uncontrolled; we deploy to customers over whom we have no control, and with whom communication is much more difficult. As a consequence it is much more important to get it "right first time", demanding higher standards of development, testing and governance. The external environment also brings enhanced risk, particularly reputational, legal and legislative risks for the whole organisation. In this context governance of IT ceases to be primarily within the authority of the CIO, it extends to the whole board; the CEO, COO, Marketing Director and Sales Director all have a responsibility and a right to oversee that which is deployed externally. It becomes part of the wider corporate governance burden.
When we deploy IT externally it ceases to be "IT". Instead it becomes a product or service of the organisation. It might be a web-shopping facility, a laser printer, a satellite or the fly-by-wire control systems of an aircraft. Whichever, the consequences of failure have much greater impacts upon our customers and therefore upon our organisations. Higher standards are required.
Development managers in companies making high-tech products understand this, and work within more demanding governance regimes than most IT functions, it is relatively rare that we see a major technology product manufacturer exhibiting such failure as the iPhone4 debacle; companies such as Boeing, Ford, General Motors, Xerox, Hewlett-Packard etc. have long understood the need for advanced governance mechanisms to protect themselves from the commercial consequences of failure.
Of course no governance system is perfect. The Chinook FADEC example shows us this. The governance mechanism worked; RAF Boscombe Down refused to authorise use of the FADEC control system, but their informed and expert decision was over-ridden within the highest levels of MoD and Government.
We all have to take risk when introducing technology innovation. I have done so many times, including committing major corporations to the mass-production and sale of complex systems that we knew to be imperfect, however we also knew what the imperfections were, and what impact they might have - we did the testing and risk analysis before committing to the risk.
I absolutely agree with you about complexity reduction, it has a major part to play in reducing risk and increasing reliability. I also agree that we need Proportional Governance, I have explained above why different levels of governance are required for internal vs. external systems, and one could extend proportionality further to encompass different levels of risk. However I also think the "IT Profession" has a lot to learn about governance.
We should take lessons from the great technology product development corporations in developing IT Governance models which address the needs of external IT, treating it as the commercial product or service that it is. We, as organisations, should understand, from the Chairman of the Board down, that governance of externally-facing IT is a corporate matter, it belongs to the board and cannot be delegated to the sole responsibility of the CIO. The governance of business is the balance of risk and reward, which falls squarely in the remit of the board as expressed in both the Turnbull Report http://bit.ly/9gDULJ and the UK Corporate Governance Code http://bit.ly/aWJTHN . Thus a failure of customer-facing IT may be an IT failure, but it is also a corporate failure.
Returning to the parochial concerns of the IT Manager, Director, CIO. There is almost never enough time and resource for testing, we all know that although our masters may not appreciate it. Most Great IT Leaders have a strategy for mitigating the risk of defective IT rollouts - we call it roll-back. It is incumbent on every IT leader to have a fallback option when exercising risk, and the governance regime should enbrace this. If we cannot practically test to a very high level of confidence we should at least engineer a recovery position in our new deployments, enabling us to roll-back quickly to a workable position. When faced with failure it is always better to retreat, regroup and retry, pressing on regardless simply means that instead of losing the battle we risk losing the war.
You asked me to share if I know something that "the rest of us don't". As I've said, no governance system is perfect, but I would commend to any IT development leader the "Xerox Product Delivery Process (PDP)" as an example of a governance regime for the development and roll-out of public-facing technology. Not only does it embrace the roles of all corporate stakeholders, for instance in Management, Marketing, Sales, Operations and Support, it is also well proven having been in use and continually refined for over 20 years by one of the most innovative technology development companies in the world. Other great corporations have their own mechanisms, the point is that exemplars exist. Not only should we do better, the roadmaps to improvement are already available.
Cheers, Steve
Posted by: Steve Burrows | July 28, 2010 at 10:21 AM
Hiya,
I've worked with one IT shop who took the interesting approach of only promising to get 80% right, 80% of the time. The idea was that they accept some failure as a consequence of being efficient and timely in the delivery of solutions into the business. They got away with this as their definition of "fit for purpose" was not "totally bullet proof" but "with a manageable level of risk", with is pretty much the point you're making :)
r.
PEG
Posted by: Peter Evans-Greenwood | July 28, 2010 at 10:34 AM
Steve I think we are mostly in agreement.
Proportionality of governance is, indeed, the key thing.
I dispute, however, that having a great process means you get to zero failure. All you do is reduce the risk of failure to some degree, but there is a cost of that risk reduction, whether you define the governance as internal to IT or not.
The argument I'm making is that sooner or later, those costs mount up to such a degree that you are left with taking a risk anyway,
I guess what you're really saying is there are great examples available that show how to get more information before you hit the go-button. I agree with that.
And I also agree that the IT profession has a lot to learn about getting the information cheaply and effectively. As we improve that capability, no doubt we will decrease the failure rate.
I doubt we will ever get to zero, though.
Posted by: James Gardner | July 28, 2010 at 02:03 PM
It seems to me, the question here is clear. Can we build any IT system with a zero-percent chance of failure?
The answer of course is no, we can't. That's where the risk assessment comes in. So what is risk?
Well this is something I've worked on in banks, so here's my two-cents worth. Risk is an estimate of what could go wrong, the likelihood of it going wrong, the impact of it going wrong and the resulting cost of it going wrong. Simple.
Governance fits in here only in the last area, the cost, the punitive effect (often an FSA or similar fine).
A risk assessment merely returns a score. It might be that there is a two-percent chance of failure. The risk assessment doesn't make a system better, it merely informs the business what the balance of probability is. The business then accepts or rejects on that basis.
So, an "IT failure" isn't a failure of risk, its a throw of the dice and an unfortunate - but known outcome. It would only be a failure if an aspect of the system hadn't been considered in the original assessment.
Now the scary part. James is right. The idea of "infinite resources" or "the best IT staff" is completely fallacious.
No bank department I've worked in or heard about has such a budget nor would ever deploy it. And that includes Tesco, where I've also worked. James is right to make this point.
But what is the result of a system outage or compliance failure?
The answer is - in the private sector at least - potential damage to the share price and therefore investor value and the profitability (and bonuses) of the business.
Now the truth that no one will actually tell you. Real-life bank compliance is actually simply a proportion of any system cost set aside to pay for any fines incurred as a result of compliance-governance system failure. It is there to ensure such fines never impact on the balance sheet.
James suggests that more governance in itself won't help. I think he's right. More compliance is only necessary where insufficient was there in the first place - like in the banking crash.
More compliance merely drives up costs of delivery and ultimately cost to the service consumer.
Finally, Steve makes a vital point when he say that IT ceases to be merely IT when it goes live. It becomes a business function.
We must get away from the idea of "IT" as an end in itself, it is merely the tool used deliver a service.
And you know what, some times tools break. Get over it.
Posted by: Neil Robinson | July 29, 2010 at 08:02 AM
Hi James & Neil. Dunno how this happened but you both seem to have picked out an idea that I suggested an expectation of zero failure. If you check back to the Management Matters piece I never did, so let's not get distracted by that. Perfection isn't going to happen. Failure is costly. I suggest that we in IT accept failure too easily and too often, in part as a consequence of the challenges of complexity as highlighted by James.
In order to reduce failure we need governance regimes and attitudes that help us better balance the opportunity to the risk. When the IT steps out into the customer's realm we need to treat it differently, like other tech products we sell where the cost of failure is higher. We need to view it in terms of corporate, not IT, impact, and engage the corporate organs in accepting the risk and providing the resource necessary to mitigate it.
Also, for important customer-facing services we need a back-up plan, a roll-back situation, a default action that we will perform if things go pear shaped so that we create time and opportunity for a graceful low-impact recovery.
Could go on but aside from this misinterpretation I think we are in broad agreement. Leaving customers in the lurch for days or weeks is not an outcome that any of us should find acceptable.
Cheers, Steve
Posted by: Steve Burrows | July 30, 2010 at 02:25 PM
Wow! A lot of comments, some longer than the post. I will admit to having only glanced at the comments, so there is a good chance somebody may have already mentioned what I want to state as briefly as possible. In responding, I am going to overlook the toggle from "IT Governance" to "governance." I will just talk about IT Governance.
First, IT Governance is a function of the business, not a function of IT. IT participates, facilitates and enables, but it is a business function. (I acknowledge the business seldom drives IT Governance.)
Second, in my humble opinion, IT Governance involves much more than risk management. IT Governance (or, Business Governance of IT) ensures:
- IT is aligned with the business
- IT delivers value to the business
- IT appropriately manages risk
- IT appropriately manages resources
- IT appropriately manages performance
Given the principles I list above and the magnitude of the failure, it is nearly impossible for IT Governance to not be at fault in some way. (I don't have enough data to even guess what aspect of IT Governance failed.) And though the failure may have been unavoidable, it is a failure nonetheless.
Steve Romero, IT Governance Evangelist
http://community.ca.com/blogs/theitgovernanceevangelist/
Posted by: Steven Romero, IT Governance Evangelist | August 02, 2010 at 04:58 PM
The problem is a lack of effective governance and attributing the problem to IT is itself an indication of this.
These projects are not about changing IT - they are about changing the whole system of business. This involves consideration of all people, process, structure and technology aspects. When corporate and project governance fails to give this due consideration and shifts the problem to IT, IT must try to compensate with unnecessarily complex solutions. And the risk of a whole system failure increases markedly.
The Queensland Health Payroll System debacle is a case in point. See Mark Toomey's excellent commentary on inept governance. Mark evaluates this project through the lens of ISO/IEC 38500 and identifies the causes of failure that could have been foreseen and acted upon by those who had the real authority and accountability for the success of the project.
Public facing or multiple agency projects demand better governance and acceptance of accountability by business leaders for the performance of the whole system of business.
Posted by: Basil Wood | August 02, 2010 at 10:34 PM
Hmmm. It's nearly all been said. James, I have some real difficulty with what you are saying - for one simple reason. You define governance thus "The role of governance is to provide the information that managers need in order to make their risk decision about a system". Sorry, that's not governance. In this post, the Steves have it. Governance is important, and it’s not management. Governance is fundamentally a system that enables the top level governing body to ensure that the organisations managers do their job properly. It engages with the management systems where the detail of planning and control happen, guiding decisions with strategy and policy, and monitoring the outcomes that are produced. Don’t for a minute think that it has to be bureaucratic – indeed bureaucracy in the governance and management systems is as often a problem with managers not doing their jobs properly as it is any issue with governance being a complex business. Oh and that’s the big picture of governance. Governance of IT is just as straight forward – it’s the system for directing and controlling the current and future use of IT in an organisation. Note that I said use – that’s not just the supply – it includes the demand aspects as well. Governance of IT should be ensuring that both business and IT leaders are doing their jobs properly. When they do, there is a good chance that we will get the systems the business needs, and that they will work well enough, and that when there is a problem, it will be dealt with efficiently and effectively.
Some will argue passionately that building the perfect IT system is not possible. Nor is building the perfect aeroplane. But commercial and military aeroplanes today are more complex than most of the IT required to run either a business or the machinery of government. Their reliability seems to greatly exceed IT’s reliability. Why? Yes, they do collect metrics too – by the cubic metre in some cases. But they also do a few other things well. Most significant, perhaps – they learn from mistakes. And their managers do their jobs properly. If they don’t, the learning lessons process discovers their failures and takes them out of the loop.
Imagine how much better off we would be if organisations that use and depend on IT could do that one simple thing – learn from mistakes.
Good governance should ensuring that we learn from mistakes. Then would anybody argue that governance is not important, or that it is not the answer to avoiding disasters?
Posted by: Mark Toomey | August 05, 2010 at 01:15 PM
James,
You have pinpointed the problem exactly when you say, "Systems failures don't happen because of governance, or the lack of it. They happen because large systems which connect to other large systems are simply too complex for anyone to be sure of what's going to happen when you turn them on."
The solution is not to try to figure out better ways to govern complex systems. Nor is the solution to figure out better ways to test complex systems. The solution is to figure out better ways to get rid of the complexity.
But wait a minute. If we are solving complex problems, don't we need complex solutions? No, we don't. We should never confuse complexity in the problem space (unavoidable) with complexity in the solution space (absolutely avoidable.)
In a recent development, the U.S. Patent office has, for the first time in history, granted a patent to a methodology to remove complexity from large IT systems. (disclosure: I am the patent holder.) In granting this patent, the patent office has agreed that IT complexity is measurable, reducible, and that this methodology is unique and novel in its approach to minimizing that complexity.
As long as we look for better ways to document, govern, test, and/or understand complex systems, we are doomed to fail. Complexity cannot be coddled. It is a disease. It must be eliminated.
We don't know how to build complex systems and we never will. Complex systems are, as you point out, too complex. And yet, we are stuck with the need to solve complex problems. Our only hope, then, is to learn to build simple systems that solve complex problems.
Posted by: Roger Sessions | August 15, 2010 at 10:53 PM
Governance is a stream of management.
Management should include empowerment - including recognition (and reward). Governance bodies & processes are rarely able to include this.
Also, Governance rarely scales itself to the degree of risk undertaken - my own company has suggestions of how to manage Security as a risk for this very reason.
Posted by: Mike Broomhead | August 18, 2010 at 02:19 AM