What does “input validation” even mean any more?

Information Security is full of terminology.

Sometimes we even understand what we mean. I’ve yet to come across a truly awesome, yet brief, definition of “threat”, for instance.

But one that bugs me, because it shouldn’t be that hard to get right, and because I hear it from people I otherwise respect greatly, is that of “input validation”.

What is “validation”?

Fight me on this, but I think that validation is essentially a yes/no decision on a set of input, whether it’s textual, binary, or whatever other format you care to define.

Exactly what you are validating is up for debate, whether you’re looking at syntax or semantics – is it formatted correctly, versus does it actually make sense?

Syntax versus semantics

“Green ideas sleep furiously” is a famous example of a sentence that is syntactically correct – it follows a standard “Adjective noun verb adverb” pattern that is common in English – but semantically, it makes no sense: ideas can’t be green, and they can’t sleep, and nothing can sleep furiously (although my son used to sleep with his fists clenched really tight when he was a little baby).

“0 / 0” is a syntactically correct mathematical expression, but you can argue if it’s semantically correct.

“Sell 1000 shares” might be a syntactically correct instruction, but semantically, it could be you don’t have 1000 shares, or there’s a business logic limit, which says such a transaction requires extra authentication.

So there’s a difference between syntactical validation and semantic validation, but…

What’s that got to do with injection exploits?

Injection attacks occur when an input data – a string of characters – is semantically valid in the language of the enclosing code, as code itself, and not just as data. Sometimes (but not always) this means the data contains a character or character sequence that allows the data to “escape” from its data context to a code context.

Can validation stop injection exploits?

This is a question I ask, in various round-about ways, in a lot of job interviews, so it’s quite an important question.

The answer is really simple.

Yes. And no.

If you can validate your input, such that it is always syntactically and semantically correct, you can absolutely prevent injection exploits.

But this is really only possible for relatively simple sets of inputs, and where the processing is safe for that set of inputs.

How about an example?

An example – suppose I’ve got a product ordering site, and I’m selling books.

You can order an integer number of books. Strictly speaking, positive integers, and 0 makes no sense, so start at 1. You probably want to put a maximum limit on that field, perhaps restricting people to buying no more than a hundred of that book. If they’re buying more, they’ll want to go wholesale anyway.

So, your validation is really simple – “is the field an integer, and is the integer value between 1 and 100?”

What about a counter-example?

Having said “yes, and no”, I have to show you an example of the “no”, right?

OK, let’s say you’re asking for validation of names of people – what’s your validation rules?

Let’s assume you’re expecting everyone to have ‘latinised’ their name, to make it easy. All the letters are in the range a-z, or A-Z if there’s a capital letter.

Great, so there’s a rule – only match “[A-Za-z]”

Unless, you know, Leonardo da Vinci. Or di Caprio. So you need spaces.

Or Daniel Day-Lewis. So there’s also hyphens to add.

And if you have an O’Reilly, an O’Brian, or a D’Artagnan, or a N’Dour – yes, you’re going to add apostrophes.

Now your validation rule is letting in a far broader range of characters than you start out with, and there’s enough there to allow for SQL injection to happen.

Input can now be syntactically correct by your validation rule, and yet semantically equivalent to data plus SQL code.

Validation alone is insufficient to block injection attacks.

Why do people say validation is sufficient, then?

I have a working hypothesis. It goes like this.

As a neophyte in information security, you learn a trick.

That trick is validation, and it’s a great thing to share with developers.

They don’t need to be clever or worry hard about the input that comes in, they simply need to validate it.

It actually feels good to reject incorrect input, because you know you’re keeping the bad guys out, and the good guys in.

Then you find an input field where validation alone isn’t sufficient.

Something else must be done

But you’ve told everyone – and had other security folk agree with you – that validation is the way to solve injection attacks.

So you learn a new trick – a new way of protecting inputs.

And you call this ‘validation’, too

After all, it … uhh, kind of does the same thing. It stops injection attacks, so it must be validation.

What is this ‘new trick’?

This new trick is encoding, quoting, or in some way transforming the data, so the newly transformed data is safe to accept.

Every one of those apostrophes? Turn them into the sequence “'” if they’re going into HTML, or double them if they’re in a SQL string, or – and this is FAR better – use parameterised queries so you don’t have to even know how the input string is being encoded on its way into the SQL command.

Now your input can be validated – and injection attacks are stopped.

But it’s not validation any more

In fact, once you’ve encoded your inputs properly, your validation can be entirely open and empty! At least from the security standpoint, because you’ve made the string semantically entirely meaningless to the code in which it is to be embedded as data. There are no escape characters or sequences, because they, too, have been encoded or transformed into semantically safe data.

It’s encoding… or transformation.

And I happen to think it’s important to separate the two concepts of validation and encoding.

Validation is saying “yes” or “no” to the question “is this string ‘good’ data?” You can validate in a number of different ways, and with good defence in depth, you’ll validate at different locations, based on different knowledge about what is “good”. This matches very strongly with the primary dictionary definition of “validation” – it’s awesome when a technical term matches very closely with a common language term, because teaching it to others becomes easier.

Encoding doesn’t say “yes” or “no”, encoding simply takes whatever input it’s given, and makes it safe for the next layer to which the data will be handed.

Stop calling encoding “validation”

It’s not.

Leave a Reply

Your email address will not be published. Required fields are marked *