Category Archives: Stack Overflow

Anti-pattern: parallel collections

(Note that I’m not talking about "processing collections in parallel, which is definitely not an anti-pattern…)

I figured it was worth starting to blog about anti-patterns I see frequently on Stack Overflow. I realize that some or all of these patterns may be collected elsewhere, but it never hurts to express such things yourself… it’s a good way of internalizing information, aside from anything else. I don’t guarantee that my style of presenting these will stay consistent, but I’ll do what I can…

The anti-patterns themselves are likely to be somewhat language-agnostic, or at the very least common between Java and C#. I’m likely to post code samples in C#, but I don’t expect it to be much of a hindrance to anyone coming from a background in a similar language.

Context

You have related pieces of data about each of several items, and want to keep this data in memory. For example, you’re writing a game and have multiple players, each with a name, score and health.

Anti-pattern

Each kind of data is stored (all the names, all the scores, all the health values) in a separate collection. Typically I see this with arrays. Then each time you need to access related values, you need to make sure you’re using the same index for each collection.

So the code might look like this:

class Game 

    string[] names;
    int[] scores;
    int[] health;

    public Game() 
    { 
        // Initialize all the values appropriately
    } 

    public void PrintScores() 
    { 
        for (int i = 0; i < names.Length; i++) 
        { 
            Console.WriteLine("{0}: {1}", names[i], scores[i]); 
        } 
    } 
}

Preferred approach

The code above fails to represent an entity which seems pretty obvious when you look at the description of the data: a player. Whenever you find yourself describing pieces of data which are closely related, you should make sure you have some kind of representation of that in your code. (In some cases an anonymous type is okay, but often you’ll want a separate named class.)

Once you’ve got that type, you can use a single collection, which makes the code much cleaner to work with.

class Player
{
    public string Name { get; private set; }
    public int Score { get; set; }
    public int Health { get; set; }

    public Player(string name, int health)
    {
        Name = name;
        Health = health;
    }

    // Consider other player-specific operations
}

class Game
{
    List<Player> players;

    public Game()
    {
        // Initialize players appropriately
    }

    public void PrintScores()
    {
        foreach (var player in players)
        {
            Console.WriteLine("{0}: {1}", player.Name, player.Score);
        }
    }
}

Note how we can now use a foreach loop to iterate over our players, because we don’t care need to use the same index for two different collections.

Once you perform this sort of refactoring, you may well find that there are other operations within the Game class which would be better off in the Player class. For example, if you also had a Level property, and increasing that would automatically increase a player’s health and score, then it makes much more sense for that "level up" operation to be in Player than in Game. Without the Player concept, you’d have nowhere else to put the code, but once you’ve identified the relationship between the values, it becomes much simpler to work with.

It’s also much easier to modify a single collection than multiple ones. For example, if we wanted to add or remove a player, we now just need to make a change to a single collection, instead of making sure we perform the same operation to each "single value" collection in the original code. This may sound like a small deal, but it’s easy to make a mistake and miss out on one of the collections somewhere. Likewise if you need to add another related value – like the "level" value described above – it’s much easier to add that in one place than adding another collection and then making sure you do the right thing in every piece of code which changes any of the other collections.

Summary

Any time you find yourself with multiple collections sharing the same keys (whether those are simple list indexes or dictionary keys), think about whether you could have a single collection of a type which composes the values stored in each of the original collections. As well as making it easier to handle the collection data, you may find the new type allows you to encapsulate other operations more sensibly.

Update: what about performance?

As some readers have noted, this transformation can have an impact on performance. Originally, all the scores were kept close together in memory, all the health etc. If you perform a bulk operation on the scores (finding the average score, for example) that locality of reference can have a significant impact. In some cases that may be enough justification to use the parallel collections instead… but this should be a conscious decision, having weighed up the pros and cons and measured the performance impact. Even then, I’d be tempted to encapsulate that PlayerCollection in a separate type, allowing it to implement IEnumerable<Player> where useful. (If you wanted the Player to be mutable, you’d need it to be aware of PlayerCollection itself.)

In almost all these anti-patterns, there will be cases where they’re the lesser of two evils – but novice developers need to be aware of them as anti-patterns to start with. As ever with performance trade-offs, I believe in first deciding on concrete performance goals, then implementing the code in the simplest possible way that meets the non-performance goals, measuring against the performance goals, and tweaking if necessary to achieve them, relying heavily on measurement.

Diagnosing issues with reversible data transformations

I see a lot of problems which look somewhat different at first glance, but all have the same cause:

  • Text is losing "special characters" when I transfer it from one computer to another
  • Decryption ends up with garbage
  • Compressed data can’t be decompressed
  • I can transfer text but not binary data

These are all cases of transforming and (usually) transferring data, and then performing the reverse transformation. Often there are multiple transformations involved, and they need to be carefully reversed in the appropriate order. For example:

  1. Convert text to binary using UTF-8
  2. Compress
  3. Encrypt
  4. Base64-encode
  5. Transfer (e.g. as text in XML)
  6. Base64-decode
  7. Decrypt
  8. Decompress
  9. Convert binary back to text using UTF-8

The actual details of each question can be different, but the way I’d diagnose them is the same in each case. That’s what this post is about – partly so that I can just link to it when such questions arise. Although I’ve numbered the broad steps, it’s one of those constant iteration situations – you may well need to tweak the logging before you can usefully reduce the problem, and so on.

1. Reduce the problem as far as possible

This is just my normal advice for almost any problem, but it’s particularly relevant in this kind of question.

  • Start by assembling a complete program demonstrating nothing but the transformations. Using a single program which goes in both directions is simpler than producing two programs, one in each direction.
  • Remove pairs of transformations (e.g. encrypt/decrypt) at a time, until you’ve got the minimal set which demonstrates the problem
  • Avoid file IO if possible: hard-code short sample data which demonstrates the problem, and use in-memory streams (ByteArrayInputStream/ByteArrayOutputStream in Java; MemoryStream in .NET) for temporary results
  • If you’re performing encryption, hard-code a dummy key or generate it as part of the program.
  • Remove irrelevant 3rd party dependencies if possible (it’s simpler to reproduce an issue if I don’t need to download other libraries first)
  • Include enough logging (just console output, usually) to make it obvious where the discrepancy lies

In my experience, this is often enough to help you fix the problem for yourself – but if you don’t, you’ll be in a much better position for others to help you.

2. Make sure you’re really diagnosing the right data

It’s quite easy to get confused when comparing expected and actual (or before and after) data… particularly strings:

  • Character encoding issues can sometimes be hidden by spurious control characters being introduced invisibly
  • Fonts that can’t display all characters can make it hard to see the real data
  • Debuggers sometimes "helpfully" escape data for you
  • Variable-width fonts can make whitespace differences hard to spot

For diagnostic purposes, I find it useful to be able to log the raw UTF-16 code units which make up a string in both .NET and Java. For example, in .NET:

static void LogUtf16(string input)
{
    // Replace Console.WriteLine with your logging approach
    Console.WriteLine("Length: {0}", input.Length);
    foreach (char c in input)
    {
        Console.WriteLine("U+{0:x4}: {1}", (uint) c, c);
    }
}

Binary data has different issues, mostly in terms of displaying it in some form to start with. Our diagnosis tools are primarily textual, so you’ll need to perform some kind of conversion to a string representation in order to see it at all. If you’re really trying to diagnose this as binary data (so you’re interested in the raw bytes) do not treat it as encoded text using UTF-8 or something similar. Hex is probably the simplest representation that allows differences to be pinpointed pretty simply. Again, logging the length of the data in question is a good first step.

In both cases you might want to include a hash in your diagnostics. It doesn’t need to be a cryptographically secure hash in any way, shape or form. Any form of hash that is likely to change if the data changes is a good start. Just make sure you can trust your hashing code! (Every piece of code you write should be considered suspect – including whatever you decide to use for converting binary to hex, for example. Use trusted third parties and APIs provided with your target platform where possible. Even though adding an extra dependency for diagnostics makes it slightly harder for others to reproduce the problem, it’s better than the diagnostics themselves being suspect.)

3. Analyze a clean, full, end-to-end log

This can be tricky when you’ve got multiple systems and platforms (which is why if you can possibly reproduce it in a single program it makes life simpler) but it’s really important to look at one log for a complete run.

Make sure you’re talking about the same data end-to-end. If you’re analyzing live traffic (which should be rare; unless the problem is very intermittent, this should all be done in test environments or a developer machine) or you have a shared test environment you need to be careful that you don’t use part of the data from one test and part of the data from another test. I know this sounds trivial, but it’s a really easy mistake to make. In particular, don’t assume that the data you’ll get from one part of the process will be the same run-to-run. In many cases it should be, but if the overall system isn’t working, then you already know that one of your expectations is invalid.

Compare "supposed to be equal" parts of the data. As per the steps in the introduction, there should be pairs of equal data, moving from the "top and bottom" of the transformation chain towards the middle. Initially, you shouldn’t care about whether you view the transformation as being correct – you’re only worried about whether the output is equal to the input. If you’ve managed to preserve all the data, the function of the transformation (encryption, compression etc) becomes relevant – but if you’re losing data, anything else is secondary. This is where the hash from the bottom of step 2 is relevant: you want to be able to determine whether the data is probably right as quickly as possible. Between "length" and "hash", you should have at least some confidence, which will let you get to the most likely problem as quickly as possible.

4. Profit! (Conclusion…)

Once you’ve compared the results at each step, you should get an idea of which transformations are working and which aren’t. This may allow you to reduce the problem further, until you’ve just got a single transformation to diagnose. At that point, the problem becomes about encryption, or about compression, or about text encoding.

Depending on the situation you’re in, at this point you may be able to try multiple implementations or potentially multiple platforms to work out what’s wrong: for example, if you’re producing a zip file and then trying to decompress it, you might want to try using a regular decompression program to open your intermediate results, or decompress the results of compressing with a standard compression tool. Or if you’re trying to encrypt on Java and decrypt in C#, implement the other parts in each platform, so you can at least try to get a working "in-platform" solution – that may well be enough to find out which half has the problem.

To some extent all this blog post is about is reducing the problem as far as possible, with some ideas of how to do that. I haven’t tried to warn you much about the problems you can run into in any particular domain, but you should definitely read Marc Gravell’s excellent post on "How many ways can you mess up IO?" and my earlier post on understanding the meaning of your data is pretty relevant too.

As this is a "hints and tips" sort of post, I’ll happily modify it to include reader contributions from comments. With any luck it’ll be a useful resource for multiple Stack Overflow questions in the months and years to come…

Stack Overflow question checklist

My earlier post on how to write a good question is pretty long, and I suspect that even when I refer people to it, often they don’t bother reading it. So here’s a short list of questions to check after you’ve written a question (and to think about before you write the question):

  • Have you done some research before asking the question? 1
  • Have you explained what you’ve already tried to solve your problem?
  • Have you specified which language and platform you’re using, including version number where relevant?
  • If your question includes code, have you written it as a short but complete program? 2
  • If your question includes code, have you checked that it’s correctly formatted? 3
  • If your code doesn’t compile, have you included the exact compiler error?
  • If your question doesn’t include code, are you sure it shouldn’t?
  • If your program throws an exception, have you included the exception, with both the message and the stack trace?
  • If your program produces different results to what you expected, have you stated what you expected, why you expected it, and the actual results?
  • If your question is related to anything locale-specific (languages, time zones) have you stated the relevant information about your system (e.g. your current time zone)?
  • Have you checked that your question looks reasonable in terms of formatting?
  • Have you checked the spelling and grammar to the best of your ability? 4
  • Have you read the whole question to yourself carefully, to make sure it makes sense and contains enough information for someone coming to it without any of the context that you already know?

    If the answer to any of these questions is "no" you should take the time to fix up your question before posting. I realize this may seem like a lot of effort, but it will help you to get a useful answer as quickly as possible. Don’t forget that you’re basically asking other people to help you out of the goodness of their heart – it’s up to you to do all you can to make that as simple as possible.


    1 If you went from "something’s not working" to "asking a question" in less than 10 minutes, you probably haven’t done enough research.

    2 Ideally anyone answering the question should be able to copy your code, paste it into a text editor, compile it, run it, and observe the problem. Console applications are good for this – unless your question is directly about a user interface aspect, prefer to write a short console app. Remove anything not directly related to your question, but keep it complete enough to run.

    3 Try to avoid code which makes users scroll horizontally. You may well need to change how you split lines from how you have it in your IDE. Take the time to make it as clear as possible for those trying to help you.

    4 I realize that English isn’t the first language for many Stack Overflow users. We’re not looking for perfection – just some effort. If you know your English isn’t good, see if a colleague or friend can help you with your question before you post it.

  • Stack Overflow and personal emails

    This post is partly meant to be a general announcement, and partly meant to be something I can point people at in the future (rather than writing a short version of this on each email).

    These days, I get at least a few emails practically every day along the lines of:

    "I saw you on Stack Overflow, and would like you to answer this development question for me…"

    It’s clear that the author:

    • Is aware of Stack Overflow
    • Is aware that Stack Overflow is a site for development Q&A
    • Is aware that I answer questions on Stack Overflow

    … and yet they believe that the right way of getting me to answer a question is by emailing it to me directly. Sometimes it’s a link to a Stack Overflow question, sometimes it’s the question asked directly in email.

    In the early days of Stack Overflow, this wasn’t too bad. I’d get maybe one email like this a week. Nowadays, it’s simply too much.

    If you have a question worthy of Stack Overflow, ask it on Stack Overflow. If you’ve been banned from asking questions due to asking too many low-quality ones before, then I’m unlikely to enjoy answering your questions by email – learn what makes a good question instead, and edit your existing questions.

    If you’ve already asked the question on Stack Overflow, you should consider why you think it’s more worthy of my attention than everyone else’s questions. You should also consider what would happen if everyone who would like me to answer a question decided to email me.

    Of course in some cases it’s appropriate. If you’ve already asked a question, written it as well as you can, waited a while to see if you get any answers naturally, and if it’s in an area that you know I’m particularly experienced in (read: the C# language, basically) then that’s fine. If your question is about something from C# in Depth – a snippet which doesn’t work or some text you don’t understand, for example – then it’s entirely appropriate to mail me directly.

    Basically, ask yourself whether you think I will actually welcome the email. Is it about something you know I’m specifically interested in? Or are you just trying to get more attention to a question, somewhat like jumping a queue?

    I’m aware that it’s possible this post makes me look either like a grumpy curmudgeon or (worse) like an egocentric pseudo-celebrity. The truth is I’m just like everyone else, with very little time on my hands – time I’d like to spend as usefully and fairly as possible.

    Diagnosing weird problems – a Stack Overflow case study

    Earlier, I came across this Stack Overflow question. I solved it, tweeted it, but then thought it would serve as a useful case study into the mental processes I go through when trying to solve a problem – whether that’s on Stack Overflow, at work, or at home.

    It’s definitely worth reading the original question, but the executive summary is:

    When I compute the checksum/hash of c:\Windows\System32\Calc.exe using various tools and algorithms, those tools all give the same answer for each algorithm. When I try doing the same thing in Java, I get different results. What’s going on?

    Now to start with, I’d like to shower a bit of praise on the author:

    • The post came with a short but utterly complete program to demonstrate the problem
    • The comments in the program showed the expected values and the actual values
    • The code was mostly pretty clean (clean enough for an SO post anyway)

    In short, it had pretty much everything I ask for in a question. Yay! Additionally, the result seemed to be strange. The chances of any one of Java’s hashing algorithms being broken seem pretty slim, but four of them? Okay, now you’ve got me interested.

    Reproducing the problem

    Unless I can spot the error immediately, I usually try to reproduce the problem in a Stack Overflow post with a short but complete program. In this case, the program was already provided, so it was just a matter of copy/paste/compile/run. This one had the additional tweak that it was comparing the results of Java with the results of other tools, so I had to get hold of an MD5 sum tool first. (I chose to start with MD5 for no particular reason.) I happened to pick this one, but it didn’t really seem it would make much difference. (As it happens, that was an incorrect assumption, but hey…)

    I ran md5sums on c:\Windows\System32\calc.exe, and got the same result as the poster. Handy.

    I then ran the Java code, and again got the same result as the poster: step complete, we have a discrepancy between at least one tool (and MD5 isn’t that hard to get right) and Java.

    Looking for obvious problems

    The code has four main areas:

    • Reading a file
    • Updating digests
    • Converting bytes to hex
    • Storing and displaying results

    Of these, all of the first three have common and fairly simple error modes. For example:

    • Failure to use the return value from InputStream.read()
    • Failure to update the digests using only the relevant portion of the data buffer
    • Failure to cope with Java’s signed bytes

    The code for storing and displaying results seemed solid enough to ignore to start with, and brief inspection suggested that the first two failure modes had been avoided. While the hex code didn’t have any obvious problems either, it made sense to check it. I simply printed the result of hard-coding the "known good" CRC-32 value:

    System.out.println(toHex(new byte[] {
        (byte) 0x8D, (byte) 0x8F, (byte) 0x5F, (byte) 0x8E
      }));

    That produced the right result, so I ruled out that part of the code too. Even if it had errors in some cases, we know it’s capable of producing the right string for one of the values we know we should be returning, so it can’t be getting that value.

    Initial checks around the file

    I’m always suspicious of stream-handling code – or rather, I know how easily it can hide subtle bugs. Even though it looked okay, I thought I’d check – so I added a simple total to the code so I could see how many bytes had been hashed. I also removed all the hash algorithms other than MD5, just for simplicity:

    MessageDigest md5 = MessageDigest.getInstance("MD5");

    FileInputStream fis = new FileInputStream(file);
    byte data[] = new byte[size];
    int len = 0;
    int total = 0;
    while ((len = fis.read(data)) != -1) {
        md5.update(data, 0, len);
        total += len;
    }
    fis.close();
    System.out.println("Total bytes read: " + total);

    results.put(md5.getAlgorithm(), toHex(md5.digest()));

    It’s worth noting that I haven’t tried to fix up bits of the code which I know I would change if I were actually doing a code review:

    • The stream isn’t being closed in a finally block, so we’ll have a dangling resource (until GC) if an IOException is thrown
    • The initial value of len is never read, and can be removed

    Neither of these matters in terms of the problem at hand, and closing the file "properly" would make the sample code more complicated. (For the sake of just a short sample program, I’d be tempted to remove it entirely.)

    The result showed the number of bytes being read as the command prompt did when I ran "dir c:\Windows\System32\Calc.exe" – so again, everything looks like it’s okay.

    Desperate times call for stupid measures

    Just on a whim, I decided to copy Calc.exe to a local folder (the current directory) and retry. After all, accessing a file in a system folder might have some odd notions applied to it. It’s hard to work out what, but… there’s nothing to lose just by trying a simple test. If it can rule anything out, and you’ve got no better ideas, you might as well try even the daftest idea.

    I modified the source code to use the freshly-copied file, and it gave the same result. Hmm.

    I then reran md5sums on the copied file… and it gave the same result as Java. In other words, running "md5sums c:\Windows\System32\Calc.exe" gave one result, but "md5sums CopyOfCalc.exe" gave a different result. At this point we’ve moved from Java looking like it’s behaving weirdly to md5sums looking suspicious.

    Proving the root cause

    At this point we’re sort of done – we’ve basically proved that the Java code produces the right hash for whatever data it’s given, but we’re left with the question of what’s happening on the file system. I had a hunch that it might be something to do with x86 vs x64 code (all of this was running on a 64-bit version of Windows 7) – so how do we test that assumption?

    I don’t know if there’s a simple way of running an x86 version of the JVM, but I do know how to switch .NET code between x86 and x64 – you can do that for an assembly at build time. C# also makes the hashing and hex conversion simple, so I was able to knock up a very small app to show the problem:

    using System;
    using System.IO;
    using System.Security.Cryptography;

    class Test
    {
        static void Main()
        {
            using (var md5 = MD5.Create())
            {
                string path = "c:/Windows/System32/Calc.exe";
                var bytes = md5.ComputeHash(File.ReadAllBytes(path));
                Console.WriteLine(BitConverter.ToString(bytes));
            }
        }
    }

    (For a larger file I’d have used File.OpenRead instead, but then I’d have wanted to close the stream afterwards. Somehow it wasn’t worth correcting the existing possible handle leak in the Java code, but I didn’t want to write leaky code myself. So instead I’ve got code which reads the whole file into memory unnecessarily… <sigh>)

    You can choose the architecture to build against (usually AnyCPU, x86 or x64 – though it’s interesting to see that "arm" is also an option under .NET 4.5, for obvious reasons) either from Visual Studio or using the "/platform" command line option. This doesn’t change the IL emitted (as far as I’m aware) but it’s used for interop with native code – and in the case of executables, it also determines the version of the CLR which is used.

    Building and running in x86 mode gave the same answer as the original "assumed to be good" tools; building and running in x64 mode gave the same answer as Java.

    Explaining the root cause

    At this point we’ve proved that the file system gives different results depending on whether you access it from a 64-bit process or a 32-bit process. The final piece of the puzzle was to find some something to explain why that happens. With all the evidence about what’s happening, it was now easy to search for more information, and I found this article giving satisfactory details. Basically, there are two different copies of the system executables on a 64 bit system: x86 ones which run under the 32-bit emulator, and x64 ones. They’re actually in different directories, but when a process opens a file in \Windows\System32, the copy which matches the architecture of the process is used. It’s almost as if the \Windows\System32 directory is a symlink which changes target depending on the current process.

    A Stack Overflow comment on my answer gave a final nugget that this is called the "File System Redirector".

    Conclusion

    Debugging sessions often feel a bit like this – particularly if you’re like me, and only resort to real debugging after unit testing has failed. It’s a matter of looking under all kinds of rocks, trying anything and everything, but keeping track of everything you try. At the end of process you should be able to explain every result you’ve seen, in an ideal world. (You may not be able to give evidence of the actual chain of events, but you should be able to construct a plausible chain of events which concurs with your theory.)

    Be aware of areas which can often lead to problems, and check those first, gradually reducing the scope of "stuff you don’t understand" to a manageable level, until it disappears completely.

    Upcoming speaking engagements

    It’s just occurred to me that I’ve forgotten to mention a few of the things I’ll be up to in the near-ish future. (I’ve talked about next week’s Progressive .NET session before.) This is just a quick rundown – follow the links for more blurb and details.

    .NET Developer Network – Bristol, September 21st (evening)

    I’ll be talking about async in Bristol – possibly at a high level, possibly in detail, depending on the audience experience. This is my first time talking with this particular user group, although I’m sure there’ll be some familiar faces. Come along if you’re in the area.

    Øredev 2011 – Malmö, November 9th

    It’s a whistle-stop trip to Sweden as I’m running out of vacation days; I’m flying out on the Tuesday evening and back on the Wednesday evening, but while I’m there I’ll give two talks:

    • Async 101 (yes, more async; I wonder at what point I’ll have given as many talks about it as Mads)
    • Effective technical communication (not a particularly technical talk, but definitely specific to technical communication)

    Last year I had an absolute blast – looking forward to this year, even though I won’t have as much time for socializing.

    Stack Overflow Dev Days 2011 – London, November 14th – cancelled!

    Update: Dev Days has been cancelled. I’m still hoping to do something around this topic, and there may be small-scale meet-ups in London anyway.

    Two years ago I talked about how humanity had let the world of software engineering down. This was one of the best talks I’ve ever given, and introduced the world to Tony the Pony. Unfortunately that puts the bar relatively high for this year’s talk – at least, high by my own pretty low standards.

    In a somewhat odd topic for a Christian and a happy employee of a company with a code of conduct which starts "Don’t be evil," this year’s talk is entitled "Thinking in evil." As regular readers are no doubt aware, I love torturing the C# language and forcing the compiler to work with code which would make any right-thinking software engineer cringe. I was particularly gratified recently when Eric Lippert commented on one of my Stack Overflow answers that this was "the best abuse of C# I’ve seen in a while." I’m looking forward to talking about why I think it’s genuinely a good idea to think about nasty code like this – not to use it, but to get to know your language of choice more intimately. Like last time, I have little idea of exactly what this talk will be like, but I’m really looking forward to it.

    Writing the perfect question

    Update: now that I’ve actually posted this, I’ve added a tinyurl to it for easy reference: http://tinyurl.com/so-hints. Nice and easy to remember for comments :)

    A while ago, I wrote a blog entry on how to answer questions helpfully on sites like Stack Overflow. Recently I saw a meta question about bad questions and thought it would be worth following up with another blog post on asking questions. For the sake of convenience – and as Stack Overflow is so popular – I will assume the question is going to be asked on Stack Overflow or a similar Stack Exchange site. Most of the post doesn’t actually depend on that, but if you’re asking elsewhere you may need to tweak the advice a little.

    There are plenty of similar resources around, of course – in particular, Eric Raymond’s How to Ask Questions the Smart Way is a perennial favourite. Still, I think I can bring something to the table.

    The Golden Rule: Imagine You’re Trying To Answer The Question

    If you don’t remember anything else from this post, remember this bit. Everything else follows from here. (And yes, this does smack somewhat of Matthew 7:12.)

    Once you’ve finished writing your question, read it through. Imagine you were coming to it fresh, with no context other than what’s on the screen. Does it make sense? Is it clear what’s being asked? Is it easy to read and understand? Are there any obvious areas you’d need to ask about before providing an answer? You can usually do this pretty well however stuck you are on the actual question. Just apply common sense. If there’s anything wrong with the question when you’re reading it, obviously that will be a problem for whoever’s actually trying to answer it. So fix the problems. Improve the question until you can read it and think, "If I only knew the answer to the question, it would be a pleasure to provide that answer." At that point, post and wait for the answers to come rolling in.

    Obviously this is somewhat easier to do if you have a certain amount of experience answering questions, particularly on the forum where you’re about to post. So, what should you be looking out for?

    Question title

    When a reader first sees your question, they’re likely to be scrolling down a list of snippets. The most eye-catching part of the snippet will be the title – so use that text wisely. While you can include language or platform information, you should only do so naturally – not as a sort of "header". For example, this is bad:

    Java: Why are bytes signed?

    But this is okay:

    Why are bytes signed in Java?

    Of course, you should also include this information in tags, as it will help people who pay particular attention to specific tags.

    Ideally, a question title should be a question – but frankly that’s not always feasible. I would recommend favouring a short, descriptive title which captures the theme of the question without actually being a question instead of really trying to crowbar it into the form of a question when it really doesn’t want to be. That’s not an excuse for laziness though – it’s usually possible to come up with a good title which is genuinely a question.

    It’s important that the question title is specific though, and has at least some meaning with no other information. A question such as "Why doesn’t this work?" makes absolutely no sense without the rest of the question. Likewise a "question" title of "Please help" is unlikely to do well.

    Context

    In most cases, anyone answering the question will need to know what language and platform you’re using. The basics should usually be communicated through tags, but it may very well be worth providing more information:

    • Language version (e.g. C# 4)
    • Platform version (e.g. .NET 3.5; note that this isn’t always implicit from the language version, or vice versa)
    • Operating system, if it could be relevant (e.g. particular permissions issues)
    • Any other relevant software (e.g. database type and version, the IDE you’re using, web server you’re connecting to)
    • Any other constraints. This is particularly important. It’s really annoying to give a perfectly good answer to a question, only to be told that you’re not allowed to use feature X or Y which provide the obvious solution.
      • If you have unusual constraints, it’s worth explaining why. Not only does this answer the obvious follow-up comment, but it also gives more information about what other solutions may not be applicable.

    Describe what you’ve already tried and the results of any research. (You have searched for a solution to your problem before asking it, haven’t you? Stack Overflow isn’t meant to replace basic search skills.) Often there will be other people in a similar situation, but the answers didn’t quite match your situation. Just like the above point about unusual constraints, it saves time if you can point out differences between your situation and other common ones. It’s even worth referring to other related questions explicitly – particularly if they’re on the same forum. Aside from anything else, this shows a certain amount of "due diligence" – people are generally more willing to help you if can show you’ve already put some effort in.

    You should absolutely make sure that you tag the question appropriately. If you’re not sure which exact tags are appropriate, see what gets auto-suggested and look at samples for each one. If that sounds like a lot of work, just remember how much time you may be able to save yourself in the long run. It gets easier over time, of course.

    Problem statement

    Make sure it’s obvious what you’re trying to get out of the question. Too many "questions" are actually just statements: when I do X, something goes wrong.

    Well, what did you expect it to do? What are you trying to accomplish? What have you already tried? What happened in those attempts? Be detailed: in particular, if something didn’t work, don’t just state that: tell us how it failed. If it threw an exception, what was the exception? (Don’t just give the type – give the error message and say which line threw it. If there’s a nested exception, post that too.)

    If at all possible, write a sort of "executive summary" at the start of your question, followed by a more detailed description. Remember that on the list of questions, the first few sentences will appear as a snippet. If you can get a sense of the question across in that snippet, you’re more likely to attract views from people who can answer the question.

    One trap that many posters fall into is to ask how to achieve some "small" aim, but never say what the larger aim is. Often the smaller aim is either impossible or rarely a good idea – instead, a different approach is needed. Again, if you provide more context when writing your problem statement, we can suggest better designs. It’s fine to specify how you’re currently trying to solve your bigger problem, of course – that’s likely to be necessary detail – but include the bigger goal too.

    Sample code and data

    I may be biased on this one. I’m a huge believer in sample code, both for questions and answers… and I probably use it in an unconventional way. I usually paste it into a text editor, and try to compile it from the command line. If that’s not likely to work (and the problem isn’t obvious by inspection), I’m unlikely to bother too much with it. Firing up Eclipse or Visual Studio and finding an existing project I don’t care about or starting a new one is going to take much more time.

    That means if you want me to look at code, it should:

    • Be standalone. Don’t try to talk to a database unless you really have to. (Obviously for database questions, that’s kinda hard :) If you use sample XML, provide a short but complete XML file for us to reproduce the issue with. (And the same for other file types, obviously.)
    • Be complete. If there are missing imports or using directives, that’s really annoying
    • Actually compile (unless the compilation error is the reason for the question). Don’t give me code which is "something like" the real code but which clearly isn’t the real code, and may well not exhibit the same symptoms by the time I’ve fixed it so that it compiles.
    • Ideally not bring up a UI. Unless your code is about a UI issue, don’t bring one up. Console apps are simpler, and simplicity is a huge benefit when trying to hunt down a problem.
    • Demonstrate the problem. You should be able to say, "I expected the result to be X, it’s actually Y." (You should actually say that too, so that we can check that we get the same results.)
    • Be as short as possible. If I have to wade through hundreds of lines of code to find the problem, I’m doing work that you should be doing. Often if you work hard to reduce the problem to a short but complete program, you’ll find the issue yourself. You can absolutely do this without knowing what the problem is; you should be looking to the community for their expertise, not their willingness to spend time on your problem doing the work that you can do yourself.

    Yes, this is a relatively onerous list. It doesn’t all apply to every problem, but it does apply in a great many situations. While I get put off by reams of irrelevant, badly formatted code, some of which clearly won’t compile, the inverse is true as well: if I can tell by looking at the question that the code can go through a copy/paste/compile/run cycle really quickly, I’m much more likely to pay the question significant attention.

    In data-oriented questions, it’s very often helpful to give some sample data. Cut out anything irrelevant (if your real table has 50 columns, you only need to include relevant ones) but make sure that you give enough sample input for it to be meaningful. For example, if you’re trying to group some data by a PersonID column, it’s pretty useless if there’s only one PersonID given, or if each PersonID only appears once. If you are giving examples of expected input and output, make sure it’s clear why that’s the expected output. Often I see questions which give a small number of samples, and there are various ways they could be interpreted. This is one area where it’s particularly important to reread the question from a stranger’s point of view: while a brief summary of the desired results may well make sense to someone who already knows what your app is trying to achieve, it may be gobbledygook to those trying to answer your question.

    Spelling, grammar and formatting

    I know not everyone speaks English natively. My own command of non-English languages is lamentably poor – I’m incredibly lucky that my native tongue happens to be the lingua franca of the technical world. However, if you’re trying to communicate on an English-language forum, you owe it to yourself to make an effort to write at least reasonably correct English.

    • Please use capital letters where appropriate. It really can make a big difference in the readability of text.
    • Please split your text into paragraphs. Imagine this blog post as one big paragraph – it would be almost impossible to read.
    • Please write actual words. There are undoubtedly some abbreviations which are acceptable to most readers – IMO, IIRC etc -  there’s no reason to switch into text-speak with "gr8", "bcoz", "u" and so forth. It’s unlikely that you’re actually writing your question on a phone with only a primitive keyboard; show your readers respect by writing properly. It may take you a few more seconds, but if it means you get an answer quicker, it’s surely worth the extra effort.
    • Most browsers have built-in spelling checkers these days, or at least have plug-ins or extensions available to check your text. Technical text often creates a lot of false positives for checkers, but if your spelling isn’t generally great, it’s worth looking at the suggestions.

    Having said all of this, you’re not trying to create a literary masterpiece. You’re trying to communicate your question as effectively as possible. If you’re faced with the choice between an unambiguous but ugly sentence, or a phrase which stirs the soul but leaves the reader confused about exactly what you mean, go for the unambiguous option every time.

    One way a huge number of questions can be improved with very little effort is simply formatting them properly. Stack Overflow’s markdown editor is very good – the preview below your input box is almost always accurate in terms of the eventual result, and you can always edit the question later if anything doesn’t quite work. The exact details of the markdown is beyond the scope of this article – Stack Overflow has a detailed guide though – if you’re new to the site, I’d recommend you at least skim through it.

    By far the most important kind of formatting is making code look like code. Within a text paragraph, simply surround the code with backticks `like this`. For blocks of code, just indent everything by four spaces. If you’re cutting and pasting code, it may already be indented (for example if you’re copying code within a class) but if not, the easiest way to indent everything is to paste it, select the whole code block, and then press Ctrl-K or the “{ }” button just above the editor.

    One of the important things about code formatting is that it means angle brackets (and some other symbols) are preserved instead of being swallowed by the markdown formatter. In some cases this can mean all the difference between a question which is easy to answer and one which doesn’t make any sense, particularly in terms of generics in Java and C# or templates in C++. For example, like this

    Why can’t I convert an expression of type List<string> to List<object>?

    makes no sense at all if the type arguments are removed:

    Why can’t I convert an expression of type List to List?

    Often experienced members of the site will recognise what’s going on and edit your question for you, but obviously it’s better if they don’t have to.

    Making a good impression

    Leaving aside the main body of the question, there are a few simple ways to get the community "on your side" and therefore more likely to give you a useful answer quickly.

    • Register as a user and give yourself a meaningful name. It doesn’t have to be your real name, but frankly names like "Top Coder" or "Coding Guru" look pretty silly when you’re asking a question which others find simple. That’s still better than leaving yourself as "user154232" or whatever identifier is assigned to you by default though. Aside from anything else, it shows a certain amount of commitment to the question and/or site: if you’ve bothered to give yourself a name, you’re less likely to be an "ask-and-run" questioner.
    • Keep an eye on your question. There may well be requests for clarification – and of course, answers! If you receive an answer which wasn’t quite what you were looking for, explain carefully (and politely) why it’s not suitable for your purposes. Consider going back and editing your question to make it clearer for subsequent users.
    • Don’t add your own answer unless it really is an answer. Often users add extra details in an "answer" when they should really have just edited their question. Likewise editing your question is generally a better idea than adding a long comment to an existing answer – particularly if that comment contains a block of code (which won’t work well in a comment). If you do change the question in response to an answer though, it’s worth adding a comment to the answer just to let the user know that you’ve updated it though… you may well find they quickly edit their answer to match the revised question.
    • There’s no need to include greetings and sign-offs such as "Hi everyone!" and "Thanks – hope to get an answer soon" in the question. These will often be edited out by other users, as they’re basically a distraction. Greetings at the start of a question are particularly useless as they can take up valuable space in the snippet displayed in the question list.
    • Above all, be polite. Remember that no-one is getting paid to answer your question. Users are giving up their time to help you – so please be appreciative of that. If you’re asking a homework question, explain why you’re asking for help with something that traditionally you’d have to answer all by yourself. If a user suggests that your general approach is wrong and that there’s a better way of doing things, don’t take it personally: they’re trying to help you improve your code. By all means disagree robustly, but don’t start into ad hominem arguments. (This advice applies to answerers as well, of course.)
    • (Somewhat specific to Stack Overflow.) If an answer is particularly helpful or solves your problem, accept it by clicking on the tick mark by it. After you’ve asked a certain number of questions, an accept rate will be shown by your username in the question. Some users get annoyed with those with low accept rates (i.e. people who ask a lot of questions but rarely accept answers). Personally I’m not overly bothered by accept rates, but accepting an answer is still a generally good thing to do if it solved your problem. I certainly wouldn’t suggest accepting answers just to get your accept rate, if they didn’t help you.

    Conclusion and feedback

    Stack Overflow is an amazing resource (along with other Q&A sites, of course). The idea that you can get a good answer to a wide range of questions within minutes is pretty staggering… but there’s an obvious correlation between the quality of a question and the likelihood that you’ll get quick, helpful answers. Put that extra bit of effort in yourself, and it will probably pay for itself very quickly.

    I’m hoping to keep this blog post up to date with suggestions received – if I’ve missed out anything, over- or under-emphasized a specific point, or generally gone off track, let me know either in the comments here or mail me (skeet@pobox.com). If this document ends up elsewhere, then that copy may end up being the "canonical" one which is edited over time – in which case I’ll indicate that here.

    "Magic" null argument testing

    Warning: here be dragons. I don’t think this is the right way to check for null arguments, but it was an intriguing idea.

    Today on Stack Overflow, I answered a question about checking null arguments. The questioner was already using an extension similar to my own one in MiscUtil, allowing code like this:

    public void DoSomething(string name)
    {
        name.ThrowIfNull("name");

        // Normal code here
    }

    That’s all very well, but it’s annoying to have to repeat the name part. Now in an ideal world, I’d say it would be nice to add an attribute to the parameter and have the check performed automatically (and when PostSharp works with .NET 4.0, I’m going to give that a go, mixing Code Contracts and AOP…) – but for the moment, how far can we go with extension methods?

    I stand by my answer from that question – the code above is the simplest way to achieve the goal for the moment… but another answer raised the interesting prospect of combining anonymous types, extension methods, generics, reflection and manually-created expression trees. Now that’s a recipe for hideous code… but it actually works.

    The idea is to allow code like this:

    public void DoSomething(string name, string canBeNull, int foo, Stream input)
    {
        new { name, input }.CheckNotNull();

        // Normal code here
    }

    That should check name and input, in that order, and throw an appropriate ArgumentNullException – including parameter name – if one of them is null. It uses the fact that projection initializers in anonymous types use the primary expression’s name as the property name in the generated type, and the value of that expression ends up in the instance. Therefore, given an instance of the anonymous type initializer like the above, we have both the name and value despite having only typed it in once.

    Now obviously this could be done with normal reflection – but that we be slow as heck. No, we want to effectively find the properties once, and generate strongly typed delegates to perform the property access. That sounds like a job for Delegate.CreateDelegate, but it’s not quite that simple… to create the delegate, we’d need to know (at compile time) what the property type is. We could do that with another generic type, but we can do better than that. All we really need to know about the value is whether or not it’s null. So given a "container" type T, we’d like a bunch of delegates, one for each property, returning whether that property is null for a specified instance – i.e. a Func<T, bool>. And how do we build delegates at execution time with custom logic? We use expression trees…

    I’ve now implemented this, along with a brief set of unit tests. The irony is that the tests took longer than the implementation (which isn’t very unusual) – and so did writing it up in this blog post. I’m not saying that it couldn’t be improved (and indeed in .NET 4.0 I could probably make the delegate throw the relevant exception itself) but it works! I haven’t benchmarked it, but I’d expect it to be nearly as fast as manual tests – insignificant in methods that do real work. (The same wouldn’t be true using reflection every time, of course.)

    The full project including test cases is now available, but here’s the (almost completely uncommented) "production" code.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Reflection;
    using System.Linq.Expressions;

    public static class Extensions
    {
        public static void CheckNotNull<T>(this T container) where T : class
        {
            if (container == null)
            {
                throw new ArgumentNullException("container");
            }
            NullChecker<T>.Check(container);
        }

        private static class NullChecker<T> where T : class
        {
            private static readonly List<Func<T, bool>> checkers;
            private static readonly List<string> names;

            static NullChecker()
            {
                checkers = new List<Func<T, bool>>();
                names = new List<string>();
                // We can’t rely on the order of the properties, but we
                // can rely on the order of the constructor parameters
                // in an anonymous type – and that there’ll only be
                // one constructor.
                foreach (string name in typeof(T).GetConstructors()[0]
                                                 .GetParameters()
                                                 .Select(p => p.Name))
                {
                    names.Add(name);
                    PropertyInfo property = typeof(T).GetProperty(name);
                    // I’ve omitted a lot of error checking, but here’s
                    // at least one bit…
                    if (property.PropertyType.IsValueType)
                    {
                        throw new ArgumentException
                            ("Property " + property + " is a value type");
                    }
                    ParameterExpression param = Expression.Parameter(typeof(T), "container");
                    Expression propertyAccess = Expression.Property(param, property);
                    Expression nullValue = Expression.Constant(null, property.PropertyType);
                    Expression equality = Expression.Equal(propertyAccess, nullValue);
                    var lambda = Expression.Lambda<Func<T, bool>>(equality, param);
                    checkers.Add(lambda.Compile());
                }
            }

            internal static void Check(T item)
            {
                for (int i = 0; i < checkers.Count; i++)
                {
                    if (checkers[i](item))
                    {
                        throw new ArgumentNullException(names[i]);
                    }
                }
            }
        }
    }

    Oh, and just as a miracle – the expression tree worked first time. I’m no Marc Gravell, but I’m clearly improving :)

    Update: Marc Gravell pointed out that the order of the results of Type.GetProperties isn’t guaranteed – something I should have remembered myself. However, the order of the constructor parameters will be the same as in the anonymous type initialization expression, so I’ve updated the code above to reflect that. Marc also showed how it could almost all be put into a single expression tree which returns either null (for no error) or the name of the "failing" parameter. Very clever :)

    Just how spiky is your traffic?

    No, this isn’t the post about dynamic languages I promise. That will come soon. This is just a quick interlude. This afternoon, while answering a question on Stack Overflow1 about the difference between using an array and a Dictionary<string, string> (where each string was actually the string representation of an integer) I posted the usual spiel about preferring readable code to micro-optimisation. The response in a comment – about the performance aspect – was:

    Well that’s not so easily said for a .com where performance on a site that receives about 1 million hits a month relies on every little ounce of efficiency gains you can give it.

    A million hits a month, eh? That sounds quite impressive, until you actually break it down. Let’s take a month of 30 days – that has 30 * 24 * 60 * 60 = 2,592,000 seconds2. In other words, a million hits a month is less than one hit every two seconds. Not so impressive. At Google we tend to measure traffic in QPS (queries per second, even if they’re not really queries – the search terminology becomes pervasive) so this is around 0.39 QPS. Astonished that someone would make such a claim in favour of micro-optimisation at that traffic level, I tweeted about it. Several of the replies were along the lines of "yeah, but traffic’s not evenly distributed." That’s entirely true. Let’s see how high we can make the traffic without going absurd though.

    Let’s suppose this is a site which is only relevant on weekdays – that cuts us down to 20 days in the month. Now let’s suppose it’s only relevant for one hour per day – it’s something people look at when they get to work, and most of the users are in one time zone. That’s a pretty massive way of spiking. We’ve gone down from 30 full days of traffic to 20 hours – or 20 * 60 * 60 = 72000 seconds, giving 14 QPS. Heck, let’s say the peak of the spike is double that – a whopping 28 QPS.

    Three points about this:

    • 28 QPS is still not a huge amount of traffic.
    • If you’re really interested in handling peak traffic of ~28 QPS without latency becoming huge, it’s worth quoting that figure rather than "a million hits a month" because the latter is somewhat irrelevant, and causes us to make wild (and probably wildly inaccurate) guesses about your load distribution.
    • If you’re going to bring the phrase "a .com" into the picture, attempting to make it sound particularly important, you really shouldn’t be thinking about hosting your web site on one server – so the QPS gets diluted again.
    • Even at 28 QPS, the sort of difference that would be made here is tiny. A quick microbenchmark (with all the associated caveats) showed that on my laptop (hardly a server-class machine) I could build the dictionary and index into it 3 times 2.8 million times in about 5 seconds. If every request needed to do that 100 times, then the cost of doing it 28 requests per second on my laptop would still only be 0.5% of that second – not a really significant benefit, despite the hugely exaggerated estimates of how often we needed to do that.

    There are various other ways in which it’s not a great piece of code, but the charge against premature optimization still stands. You don’t need to get every little ounce of efficiency out of your code. Chances are, if you start guessing at where you can get efficiency, you’re going to be wrong. Measure, measure, measure – profile, profile, profile. Once you’ve done all of that and proved that a change reducing clarity has a significant benefit, go for it – but until then, write the most readable code you can. Likewise work out your performance goals in a meaningful fashion before you worry too much – and hits per months isn’t a meaningful figure.

    Performance is important – too important to be guessed about instead of measured.


    1 I’m not linking to it because the Streisand effect would render this question more important than it really is. I’m sure you can find it if you really want to, but that’s not the point of the post.

    2 Anyone who wants to nitpick and talk about months which are a bit longer or shorter than that due to daylight saving time changes (despite still being 30 days) can implement that logic for me in Noda Time.

    Revisiting randomness

    Almost every Stack Overflow question which includes the words "random" and "repeated" has the same basic answer. It’s one of the most common "gotchas" in .NET, Java, and no doubt other platforms: creating a new random number generator without specifying a seed will depend on the current instant of time. The current time as measured by the computer doesn’t change very often compared with how often you can create and use a random number generator – so code which repeatedly creates a new instance of Random and uses it once will end up showing a lot of repetition.

    One common solution is to use a static field to store a single instance of Random and reuse it. That’s okay in Java (where Random is thread-safe) but it’s not so good in .NET – if you use the same instance repeatedly from .NET, you can corrupt the internal data structures.

    A long time ago, I created a StaticRandom class in MiscUtil – essentially, it was just a bunch of static methods (to mirror the instance methods found in Random) wrapping a single instance of Random and locking appropriately. This allows you to just call StaticRandom.Next(1, 7) to roll a die, for example. However, it has a couple of problems:

    • It doesn’t scale well in a multi-threaded environment. When I originally wrote it, I benchmarked an alternative approach using [ThreadStatic] and at the time, locking won (at least on my computer, which may well have only had a single core).
    • It doesn’t provide any way of getting at an instance of Random, other than by using new Random(StaticRandom.Next()).

    The latter point is mostly a problem because it encourages a style of coding where you just use StaticRandom.Next(…) any time you want a random number. This is undoubtedly convenient in some situations, but it goes against the idea of treating a source of randomness as a service or dependency. It makes it harder to get repeatability and to see what needs that dependency.

    I could have just added a method generating a new instance into the existing class, but I decided to play with a different approach – going back to per-thread instances, but this time using the ThreadLocal<T> class introduced in .NET 4.0. Here’s the resulting code:

    using System;
    using System.Threading;

    namespace RandomDemo
    {
        /// <summary>
        /// Convenience class for dealing with randomness.
        /// </summary>
        public static class ThreadLocalRandom
        {
            /// <summary>
            /// Random number generator used to generate seeds,
            /// which are then used to create new random number
            /// generators on a per-thread basis.
            /// </summary>
            private static readonly Random globalRandom = new Random();
            private static readonly object globalLock = new object();

            /// <summary>
            /// Random number generator
            /// </summary>
            private static readonly ThreadLocal<Random> threadRandom = new ThreadLocal<Random>(NewRandom);

            /// <summary>
            /// Creates a new instance of Random. The seed is derived
            /// from a global (static) instance of Random, rather
            /// than time.
            /// </summary>
            public static Random NewRandom()
            {
                lock (globalLock)
                {
                    return new Random(globalRandom.Next());
                }
            }

            /// <summary>
            /// Returns an instance of Random which can be used freely
            /// within the current thread.
            /// </summary>
            public static Random Instance { get { return threadRandom.Value; } }

            /// <summary>See <see cref="Random.Next()" /></summary>
            public static int Next()
            {
                return Instance.Next();
            }

            /// <summary>See <see cref="Random.Next(int)" /></summary>
            public static int Next(int maxValue)
            {
                return Instance.Next(maxValue);
            }

            /// <summary>See <see cref="Random.Next(int, int)" /></summary>
            public static int Next(int minValue, int maxValue)
            {
                return Instance.Next(minValue, maxValue);
            }

            /// <summary>See <see cref="Random.NextDouble()" /></summary>
            public static double NextDouble()
            {
                return Instance.NextDouble();
            }

            /// <summary>See <see cref="Random.NextBytes(byte[])" /></summary>
            public static void NextBytes(byte[] buffer)
            {
                Instance.NextBytes(buffer);
            }
        }
    }

    The user can still call the static Next(…) methods if they want, but they can also get at the thread-local instance of Random by calling ThreadLocalRandom.Instance – or easily create a new instance with ThreadLocalRandom.NewRandom(). (The fact that NewRandom uses the global instance rather than the thread-local one is an implementation detail really; it happens to be convenient from the point of view of the ThreadLocal<T> constructor. It wouldn’t be terribly hard to change this.)

    Now it’s easy to write a method which needs randomness (e.g. to shuffle a deck of cards) and give it a Random parameter, then call it using the thread-local instance:

    public void Shuffle(Random rng)
    {
        …
    }

    deck.Shuffle(ThreadLocalRandom.Instance);

    The Shuffle method is then easier to test and debug, and expresses its dependency explicitly.

    Performance

    I tested the "old" and "new" implementations in a very simple way – for varying numbers of threads, I called Next() a fixed number of times (from each thread) and timed how long it took for all the threads to finish. I’ve also tried a .NET-3.5-compatible version using ThreadStatic instead of ThreadLocal<T>, and running the original code and the ThreadStatic version on .NET 3.5 as well.

    Threads StaticRandom (4.0b2) ThreadLocalRandom (4.0b2) ThreadStaticRandom (4.0b2) StaticRandom(3.5) ThreadStaticRandom (3.5)
    1 11582 6016 10150 10373 16049
    2 24667 7214 9043 25062 17257
    3 38095 10295 14771 36827 25625
    4 49402 13435 19116 47882 34415

    A few things to take away from this:

    • The numbers were slightly erratic; somehow it was quicker to do twice the work with ThreadStaticRandom on .NET 4.0b2! This isn’t the perfect benchmarking machine; we’re interested in trends rather than absolute figures.
    • Locking hasn’t changed much in performance between framework versions
    • ThreadStatic has improved massively between .NET 3.5 and 4.0
    • Even on 3.5, ThreadStatic wins over a global lock as soon as there’s contention

    The only downside to the ThreadLocal<T> version is that it requires .NET 4.0 – but the ThreadStatic version works pretty well too.

    It’s worth bearing in mind that of course this is testing the highly unusual situation where there’s a lot of contention in the global lock version. The performance difference in the single-threaded version where the lock is always uncontended is still present, but very small.

    Update

    After reading the comments and thinking further, I would indeed get rid of the static methods elsewhere. Also, for the purposes of dependency injection, I agree that it’s a good idea to have a factory interface where that’s not overkill. The factory implementation could use either the ThreadLocal or ThreadStatic implementations, or effectively use the global lock version (by having its own instance of Random and a lock). In many cases I’d regard that as overkill, however.

    One other interesting option would be to create a thread-safe instance of Random to start with, which delegated to thread-local "normal" implementations. That would be very useful from a DI standpoint.