Information Security is full of terminology.
Sometimes we even understand what we mean. Iâve yet to come across a truly awesome, yet brief, definition of âthreatâ, for instance.
But one that bugs me, because it shouldnât be that hard to get right, and because I hear it from people I otherwise respect greatly, is that of âinput validationâ.
Fight me on this, but I think that validation is essentially a yes/no decision on a set of input, whether itâs textual, binary, or whatever other format you care to define.
Exactly what you are validating is up for debate, whether youâre looking at syntax or semantics â is it formatted correctly, versus does it actually make sense?
âGreen ideas sleep furiouslyâ is a famous example of a sentence that is syntactically correct â it follows a standard âAdjective noun verb adverbâ pattern that is common in English â but semantically, it makes no sense: ideas canât be green, and they canât sleep, and nothing can sleep furiously (although my son used to sleep with his fists clenched really tight when he was a little baby).
â0 / 0â is a syntactically correct mathematical expression, but you can argue if itâs semantically correct.
âSell 1000 sharesâ might be a syntactically correct instruction, but semantically, it could be you donât have 1000 shares, or thereâs a business logic limit, which says such a transaction requires extra authentication.
So thereâs a difference between syntactical validation and semantic validation, butâŠ
Injection attacks occur when an input data â a string of characters â is semantically valid in the language of the enclosing code, as code itself, and not just as data. Sometimes (but not always) this means the data contains a character or character sequence that allows the data to âescapeâ from its data context to a code context.
This is a question I ask, in various round-about ways, in a lot of job interviews, so itâs quite an important question.
The answer is really simple.
Yes. And no.
If you can validate your input, such that it is always syntactically and semantically correct, you can absolutely prevent injection exploits.
But this is really only possible for relatively simple sets of inputs, and where the processing is safe for that set of inputs.
An example â suppose Iâve got a product ordering site, and Iâm selling books.
You can order an integer number of books. Strictly speaking, positive integers, and 0 makes no sense, so start at 1. You probably want to put a maximum limit on that field, perhaps restricting people to buying no more than a hundred of that book. If theyâre buying more, theyâll want to go wholesale anyway.
So, your validation is really simple â âis the field an integer, and is the integer value between 1 and 100?â
Having said âyes, and noâ, I have to show you an example of the ânoâ, right?
OK, letâs say youâre asking for validation of names of people â whatâs your validation rules?
Letâs assume youâre expecting everyone to have âlatinisedâ their name, to make it easy. All the letters are in the range a-z, or A-Z if thereâs a capital letter.
Great, so thereâs a rule â only match â[A-Za-z]â
Unless, you know, Leonardo da Vinci. Or di Caprio. So you need spaces.
Or Daniel Day-Lewis. So thereâs also hyphens to add.
And if you have an OâReilly, an OâBrian, or a DâArtagnan, or a NâDour â yes, youâre going to add apostrophes.
Now your validation rule is letting in a far broader range of characters than you start out with, and thereâs enough there to allow for SQL injection to happen.
Input can now be syntactically correct by your validation rule, and yet semantically equivalent to data plus SQL code.
I have a working hypothesis. It goes like this.
As a neophyte in information security, you learn a trick.
That trick is validation, and itâs a great thing to share with developers.
They donât need to be clever or worry hard about the input that comes in, they simply need to validate it.
It actually feels good to reject incorrect input, because you know youâre keeping the bad guys out, and the good guys in.
Then you find an input field where validation alone isnât sufficient.
But youâve told everyone â and had other security folk agree with you â that validation is the way to solve injection attacks.
So you learn a new trick â a new way of protecting inputs.
After all, it âŠ uhh, kind of does the same thing. It stops injection attacks, so it must be validation.
This new trick is encoding, quoting, or in some way transforming the data, so the newly transformed data is safe to accept.
Every one of those apostrophes? Turn them into the sequence â'â if theyâre going into HTML, or double them if theyâre in a SQL string, or â and this is FAR better â use parameterised queries so you donât have to even know how the input string is being encoded on its way into the SQL command.
Now your input can be validated â and injection attacks are stopped.
In fact, once youâve encoded your inputs properly, your validation can be entirely open and empty! At least from the security standpoint, because youâve made the string semantically entirely meaningless to the code in which it is to be embedded as data. There are no escape characters or sequences, because they, too, have been encoded or transformed into semantically safe data.
And I happen to think itâs important to separate the two concepts of validation and encoding.
Validation is saying âyesâ or ânoâ to the question âis this string âgoodâ data?â You can validate in a number of different ways, and with good defence in depth, youâll validate at different locations, based on different knowledge about what is âgoodâ. This matches very strongly with the primary dictionary definition of âvalidationâ â itâs awesome when a technical term matches very closely with a common language term, because teaching it to others becomes easier.
Encoding doesnât say âyesâ or ânoâ, encoding simply takes whatever input itâs given, and makes it safe for the next layer to which the data will be handed.