The blame game: It’s always never human error

The Ubuntu “Circle of Friends” logo.

Depending on the kind of company you work at, it’s either:

  • a group of three friends holding hands and dancing in a merry circle
  • a group of three colleagues each pointing at the other two to tell you who to blame
  • three guys tied to a pole desperately trying to escape the rising orange flood waters

If you work at the first place, reach out to me on LinkedIn – I know some people who might want to work with you.

If you’re at the third place, you should probably get out now. Whatever they’re paying you, or however much the stock might be worth come the IPO, it’s not worth the pain and suffering.

If you’re at the second place, congratulations – you’re at a regular, ordinary workplace that could do with a little better management.

What’s this to do with security?

A surprisingly great deal.

Whenever there’s a security incident, there should be an investigation as to its cause.

Clearly the cause is always human error. Machines don’t make mistakes, they act in predictable ways – even when they are acting randomly, they can be stochastically modeled, and errors taken into consideration. Your computer behaves like a predictable machine, but at various levels it actually routinely behaves like it’s rolling dice, and there are mechanisms in place to bias those random results towards the predictable answers you expect from it.

Humans, not so much.

Humans make all the mistakes. They choose to continue using parts that are likely to break, because they are past their supported lifecycle; they choose to implement only part of a security mechanism; they forget to finish implementing functionality; they fail to understand the problem at hand; etc, etc.

It always comes back to human error.

Or so you think

Occasionally I will experience these great flashes of inspiration from observing behaviour, and these flashes dramatically affect my way of doing things.

One such was when I attended the weekly incident review board meetings at my employer of the time – a health insurance company.

Once each incident had been resolved and addressed, they were submitted to the incident review board for discussion, so that the company could learn from the cause of the problem, and make sure similar problems were forestalled in future.

These weren’t just security incidents, they could be system outages, problems with power supplies, really anything that wasn’t quickly fixed as part of normal process.

But the principles I learned there apply just as well to security incident.

Root cause analysis

The biggest principle I learned was “root cause analysis” – that you look beyond the immediate cause of a problem to find what actually caused it in the long view.

At other companies, who can’t bear to think that they didn’t invent absolutely everything, this is termed differently, for instance, “the five whys” (suggesting if you ask “why did that happen?” five times, you’ll get to the root cause). Other names are possible, but the majority of the English-speaking world knows it as ‘root cause analysis’

This is where I learned that if you believe the answer is that a single human’s error caused the problem, you don’t have the root cause.

But!

Whenever I discuss this with friends, they always say “But! What about this example, or that?”

You should always ask those questions.

Here’s some possible individual causes, and some of their associated actual causes:

Bob pulled the wrong lever Who trained Bob about the levers to pull? Was there documentation? Were the levers labeled? Did anyone assess Bob’s ability to identify the right lever to pull by testing him with scenarios?
Kate was evil and did a bad thing Why was Kate allowed to have unsupervised access? Where was the monitoring? Did we hire Kate? Why didn’t the background check identify the evil?
Jeremy told everyone the wrong information Was Jeremy given the right information? Why was Jeremy able to interpret the information from right to wrong? Should this information have been automatically communicated without going through a Jeremy? Was Jeremy trained in how to transmute information? Why did nobody receiving the information verify it?
Grace left her laptop in a taxi Why does Grace have data that we care about losing – on her laptop? Can we disable the laptop remotely? Why does she even have a laptop? What is our general solution for people, who will be people, leaving laptops in a taxi?
Jane wrote the algorithm with a bug in it Who reviews Jane’s code? Who tests the code? Is the test automated? Was Jane given adequate training and resources to write the algorithm in the first place? Is this her first time writing an algorithm – did she need help? Who hired Jane for that position – what process did they follow?

 

I could go on and on, and I usually do, but it’s important to remember that if you ever find yourself blaming an individual and saying “human error caused this fault”, it’s important to remember that humans, just like machines, are random and only stochastically predictable, and if you want to get predictable results, you have to have a framework that brings that randomness and unpredictability into some form of logical operation.

Many of the questions I asked above are also going to end up with the blame apparently being assigned to an individual – that’s just a sign that it needs to keep going until you find an organisational fix. Because if all you do is fix individuals, and you hire new individuals and lose old individuals, your organisation itself will never improve.

[Yes, for the pedants, your organisation is made up of individuals, and any organisational fix is embodied in those individuals – so blog about how the organisation can train individuals to make sure that organisational learning is passed on.]

Finally, if you’d like to not use Ubuntu as my “circle of blame” logo, there’s plenty of others out there – for instance, Microsoft Alumni:

Microsoft Alumni

Leave a Reply

Your email address will not be published. Required fields are marked *