On the Lambda

Programming, Technology, and Systems Administration

On the Lambda

Coding for Branch Prediction

June 27th, 2013 · 2 Comments · .net, c#, development

Modern CPUs don’t work the way most programmers think. We’re taught this model that a processor has an instruction pointer that increments through our code, jumping to the right place when we call a function or use a loop or conditional, and a line executes when the instruction pointer reaches that line. That used to be right, but today things are a lot more complicated. Chip manufacturers like Intel and AMD have learned that they can optimize this process to get better performance than the simplistic model would allow, and so we have things like out-of-order execution, branch prediction, and floating point bugs.

The good news is that the chip manufactures work very hard to make sure that the results, at least, are always what a programmer expects. Not only that, but performance is better on average, too. The trick is that pesky average. It means that, in certain circumstances, performance might be worse. Additionally, even when it’s better, it might not be as much better as it could be.

Take loops, for example. Say you have some data in a collection, and you want to perform an operation on every element in the collection. But maybe the value of each element changes what operation you want to perform. Let’s start with a collection and some Dictionaries to sort the items into, like this:

var items = Enumerable.Range(0, 1000000);
var ints = new Dictionary<int, string>(1000000); 
var doubles = new Dictionary<double, string>(1000000);

The argument in the dictionary constructors is to avoid memory reallocation time during my sample runs, to get better benchmarks. I also have some other code not shown to do things like discount the JITter and make sure I get good benchmarks. But let’s get into it.

You might be tempted to write code something like this:

foreach(int item in items)
{
    //different code branches that still do significant work in the cpu
    if (item % 3 == 0)
    {   //force hash computation and multiplication op (both cpu-bound)
        ints[item] = (item * 2).ToString();
    }
    else
    {
         doubles[(double)item] = (item * 3).ToString();
    }
}

This is sub-optimal, because of a cpu feature known as Branch Prediction. Branching is just a fancy way of saying there is an if block in your code. The code will follow one of two different branches, depending on the result of the conditional. Modern CPUs will try to guess which branch to use, and begin executing code down one branch before the conditional is known. If the CPU guesses right, it’s a performance win. If it guesses wrong, it’s a big performance loss. We can use this feature to improve on our original code:

//doing MORE work: need to evaluate our items two ways, allocate arrays
var intItems = items.Where(i => i % 3 == 0).ToArray();
var doubleItems = items.Where(i => i % 3 != 0).ToArray();

// but now there is no branching... adding all the ints, then adding all the doubles.
foreach (var item in intItems) { ints[item] = (item * 2).ToString(); }
foreach (var item in doubleItems) { doubles[(double)item] = (item * 3).ToString(); }

The result is that, most of the time, my benchmarks show this 2nd option runs faster. A typical result was 1118652 ticks for the naive option and 1005190 for the branch prediction-friendly option. It’s not a huge win, but it real, and for most programmers it’s counter-intuitive. If we had a more challenging conditional expression, or were doing more work in the branches, the improvement would be greater. Also, these branches are mixed. If one of the branches was chosen a lot more than the other, or if the selections tended to come in batches, the original code would win more.

But I’m not done yet. I can think of at least two other ways to take advantage of branch prediction. Here’s what produced the best results in my benchmarks:

var deferred = new List<string>(1000000);
foreach (var item in items)
{
    if (item % 3 == 0)
    {   
        ints[item] = (item * 2).ToString();
    }
    else
    {
        deferred.Add(item);
    }
}
foreach (var item in deferred) { doubles[(double)item] = (item * 3).ToString(); }

This should make no sense to anyone. At first look it seems like the worst of both worlds, and that memory allocation for the list should be nasty. Yet it produced the best results. Only if you read the code very carefully will you see that we save an entire array allocation (maybe several, thanks to the way .ToArray() works) and one complete iteration through the set with this option over the other branch prediction-friendly option.

So what do we learn here? I hope your answer is not to change the way you write loops. The performance wins in these contrived tests were small, at best. My first attempt to use the feature didn’t get it quite right, and the results were much worse, rather than better. Moreover, when I ran the same (working) code on a different model cpu, the naive code consistently won, because that cpu had very different branch prediction rules. Rather, the lesson here is, once again, that premature optimization is not worth it. What is faster on one cpu may be much worse on another, and it may be for reasons that are way outside what you expect. The best way to improve performance is to profile your code, find out where it’s really slow, and focus your efforts there… and while you’re there, make sure you’re measuring that your changes really have the effect you intend. Sure, keep branch prediction in your back pocket as something you can use… but wait until you know it’s worth it.

Tags:

2 Comments so far ↓

Leave a Comment