Win a free copy of Visual Studio 2010 Best Practices

Win A free copy of the ‘Visual Studio 2010 Best Practices’, just by commenting!

We’re giving away two ebook editions of Visual Studio 2010 Best Practices.

All you have to do to win is comment on why you think you should win a copy of the book.

I’ll pick a winner from the most creative answers in two weeks.

(function() { var po = document.createElement(‘script’); po.type = ‘text/javascript'; po.async = true; po.src = ‘https://apis.google.com/js/plusone.js'; var s = document.getElementsByTagName(‘script’)[0]; s.parentNode.insertBefore(po, s); })();

Thread synchronization of non-atomic invariants in .NET 4.5

Now that we’ve seen how a singular x86-x64 focus might affect how we can synchronize atomic invariants, let’s look at non-atomic invariants.

While an atomic invariant really doesn’t need much in the way of guarding, non-atomic invariants often do.  The rules by which the invariant is correct are often much more complex.  Ensuring an atomic invariant like int, for example is pretty easy: you can’t set it to an invalid value, you just need to make sure the value is visible.  Non-atomic invariants involve data that can’t natively be modified atomically.  The typical case is more than one variable, but can include intrinsic types that are not guaranteed to be modified atomically (like long and decimal).  There is also the fact that not all operations on an atomic type are performed atomically.

For example, let’s say I want to deal with a latitude longitude pair.  That pair of floating-point values is an invariant, I need to model accesses to that pair as an atomic operation.  If I write to latitude, that value shouldn’t be “seen” until I also write to longitude.  The following code does not guard that invariant in a concurrent context:

latitude = 39.73;


longitude = -86.27;

If somewhere else I changed these values, for example I wanted to change from the location of Indianapolis, IN to Ottawa, ON:

   1: latitude = 45.4112;


   2: longitude = -75.6981;

Another thread reading latitude/longitude if the thread was executing the above code was between line 1 and 2, would read a lat/long for some place near Newark instead of Ottawa or Indianapolis (the two lat/longs being written).  Making these write operations volatile does nothing to help make the operation atomic and thread-safe.  For example, the following is still not thread-safe:

   1: Thread.VolatileWrite(ref latitude, 45.4112);


   2: Thread.VolatileWrite(ref longitude, -75.6981);

A thread can still read latitude or longitude after line 1 executes on another thread and before line 2.  Given two variables that are publicly visible, the only way to make an operation on both “atomic” is to use lock or use a synchronization class like Monitor, Semaphore, Mutex, etc.  For example:

lock(latLongLock)


{


    latitude = 45.4112;


    longitude = -75.6981;


}

Considering latitude and longitude “volatile”, doesn’t help us at all in this situation—we have to use lock.  And once we use lock, there’s no need to consider the variables volatile, no two threads can be in the same critical region at the same time, and any side-effect resulting from executing that critical region are guaranteed to be visible as soon as the lock is released. (as well any potentially visible side-effects from other threads are guaranteed to be visible as soon as the lock is acquired).

There are circumstances where you can have loads/stores to different addresses that get reordered in relation to each other (a load can be reordered with older stores to a different memory address).  So, conceptually given two threads executing on different cores/CPUS executing the following code at the same time:

x = 1;    |    y = 1;


r1 = y;   |    r2 = x;

This could result in r1 == 0 and r2 == 0 (as described in section 8.2.3.2 of Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A) assuming r1 and r2 access was optimized by the compiler to be an register access.  The only way to avoid this would be to force a memory barrier.  The use of volatile, as we’ve seen the prior post, is not enough to ensure a memory fence is invoked under all circumstances.  This can be done manually through the use of Thread.MemoryBarrier, or through the use of lockThread.MemoryBarrier is less understood by a wide variety of developers, so using lock is almost always what should be used prior to any micro-optimizations.  For example:

lock(lockObject)


{


  x = 1;


  r1 = y;


}

and

 


lock(lockObject)


{


  y = 1;


  r2 = x;


}

This basically assumes x and y are involved in a particular invariant and that invariant needs to be guaranteed through atomic access to the pair of variables—which is done by creating a critical regions of code where only one region can be executing at a time across threads.

Revisiting the volatile keyword

The first post in this series could have came of as suggesting that volatile is always a good thing.  As we’ve seen in the above, that’s not true.  Let me be clear: using volatile in what I described previously is an optimization.  It should be a micro-optimization that should be used very, very carefully.  What is an isn’t an atomic invariant isn’t always cut and dry.  Not every operation on an atomic type is an atomic operation.

Let’s look at some of the problems of volatile:

The first, and arguably the most discussed problem, is that volatile decorates a variable not the use of that variable.  With non-atomic operations on an atomic variable, volatile can give you a false sense of security.  You may think volatile gives you thread-safe code in all accesses to that variable, but it does not.  For example:

private volatile int counter;


private void DoSomething()


{


    //...


    counter++;


    //...


}

Although many processors have a single instruction to increment an integer, “there is no guarantee of atomic read-modify-write, such as in the case of increment or decrement” [1].  Despite counter being volatile, there’s no guarantee this operation will be atomic and thus there’s no guarantee that it will be thread-safe.  In the general case, not every type you can use operator++ on is atomic—looking strictly at “counter++;”, you can’t tell if that’s thread-safe..  If counter were of type long, access to counter is no longer atomic and a single instruction to increment it is only possible on some processors (regardless of lock of guarantees that it will be used). If counter were an atomic type, you’d have to check the declaration of the variable to see if it was volatile or not before deciding if it’s potentially thread-safe.   To make incrementing a variable thread-safe, the Interlocked class should be used for supported types:

private int counter;


private void DoSomething()


{


    //...


    System.Threading.Interlocked.Increment(ref counter);


    //...


}

Non-atomic types like long, ulong (i.e. not supported by volatile) are supported by Interlocked.  For non-atomic types not supported by Interlocked, lock is recommended until you’ve verified another method is “better” and works:

private decimal counter;


private readonly object lockObject = new object();


private void DoSomething()


{


    //...


    lock(lockObject)


    {


        counter++;


    }


    //...


}



That is volatile is problematic because it can only be applied to member fields and only to certain types of member fields. 



The general consensus is that because volatile doesn’t decorate the operations that are potentially performed in a concurrent context, and doesn’t consistently lead to more efficient code in all circumstances, and passing a volatile field by ref circumvents the fields volatility, and would fail if used with non-atomic invariants, and lack of consistency with correctly guarded non-atomic operations, etc.; that the volatile operations should be explicit through the use of Interlocked, Thread.VolatileRead, Thread.VolatileWrite, or the use of lock and not through the use of the volatile keyword.



Conclusion



Concurrent and multithreaded programming is not trivial.  It involves dealing with non-sequential operations through the writing of sequential code.  It’s prone to error and you really have to know the intent of your code in order to decide not only what might be used in a concurrent context as well as what is thread-safe.  i.e. “thread-safe” is application specific. 



Despite only really having support for x86/x64 “out of the box” in .NET 4.5 (i.e. Visual Studio 2012), the potential side-effects of assuming an x86/x64 memory model just muddies the waters.  I don’t think there is any benefit to writing to a x86/x64 memory model over writing to the .NET memory model.  Nothing I’ve shown really affects existing guidance on writing thread-safe and concurrent code—some of which are detailed in Visual Studio 2010 Best Practices.



Knowing what’s going on at lower levels in any particular situation is good, and anything you do in light of any side-effects should be considered micro-optimizations that should be well scrutinized.



[1] C# Language Specification § 5.5 Atomicity of variable references

(function() { var po = document.createElement(‘script’); po.type = ‘text/javascript'; po.async = true; po.src = ‘https://apis.google.com/js/plusone.js'; var s = document.getElementsByTagName(‘script’)[0]; s.parentNode.insertBefore(po, s); })();

Thread synchronization of atomic invariants in .NET 4.5 clarifications

In Thread synchronization of atomic invariants in .NET 4.5 I’m presenting my observations of what the compiler does in very narrow context of only on Intel x86 and Intel x64 with a particular version of .NET.  You can install SDKs that give you access to compilers to other processors.  For example, if you write something for Windows Phone or Windows Store, you’ll get compilers for other processors (e.g. ARM) with memory models looser than x86 and x64.  That post was only observations in the context of x86 and x64. 

I believe more knowledge is always better; but you have to use that knowledge responsibly.  If you know you’re only ever going to target x86 or x64 (and you don’t if you use AnyCPU even in VS 2012 because some yet-to-be-created processor might be supported in a future version or update to .NET) and you do want to micro-optimize your code, then that post might give you enough knowledge to do that.  Otherwise, take it with a grain of salt.  I’ll get into a little bit more detail in part 2: Thread synchronization of non-atomic invariants in .NET 4.5 at a future date—which will include more specific guidance and recommendations.

In the case were I used a really awkwardly placed lock:

   1: var lockObject = new object();

   2: while (!complete)

   3: {

   4:     lock(lockObject)

   5:     {

   6:         toggle = !toggle;

   7:     }

   8: }

It’s important to point out the degree of implicit side-effects that this code depends on.  One, it assumes that the compiler is smart enough to know that a while loop is the equivalent of a series of sequential statements.  e.g. this is effectively equivalent to:

   1: var lockObject = new object();

   2: if (complete == false) return;

   3: lock (lockObject)

   4: {

   5:     toggle = !toggle;

   6: }

   7: if (complete == false) return;

   8: lock (lockObject)

   9: {

  10:     toggle = !toggle;

  11: }

  12: //...

That is, there is the implicit volatile read (e.g. a memory fence, from the Monitor.Enter implementation detail) at the start of the lock block and an implicit volatile write (e.g. a memory fence, from the Monitor.Exit implementation detail).

In case it wasn’t obvious, you should never write code like this, it’s simply an example—and as I pointed out in the original post, it’s confusing to anyone else reading it: lockObject can’t be shared amongst threads and the lock block really isn’t protecting toggle and can/likely to get “maintained” into something that no longer works.

In the same grain, the same can be said for the original example of this code:

   1: static void Main()

   2: {

   3:   bool complete = false; 

   4:   var t = new Thread (() =>

   5:   {

   6:     bool toggle = false;

   7:     while (!complete)

   8:     {

   9:         Thread.MemoryBarrier();

  10:         toggle = !toggle;

  11:     }

  12:   });

  13:   t.Start();

  14:   Thread.Sleep (1000);

  15:   complete = true;

  16:   t.Join();

  17: }

While this code works, it’s not apparently clear that the Thread.MemoryBarrier() is there so that our read of complete (and not toggle) isn’t optimized into a registry read.  Regardless of the degree you might be able to depend on the compiler continuing to do this is up to you.  The code is equally as valid and more clear if written to use Thread.VolatileRead, except for the fact that Thread.VolatileRead does not support the Boolean type.  It can be re-written using Int32 instead.  For example:

   1: static void Main(string[] args)

   2: {

   3:   int complete = 0; 

   4:   var t = new Thread (() =>

   5:   {

   6:     bool toggle = false;

   7:     while (Thread.VolatileRead(ref complete) == 0)

   8:     {

   9:         toggle = !toggle;

  10:     }

  11:   });

  12:   t.Start();

  13:   Thread.Sleep (1000);

  14:   complete = 1; // CORRECTION from 0

  15:   t.Join();

  16: }

Which is more clear and shows your intent more explicitly.


Thread synchronization of atomic invariants in .NET 4.5

I’ve written before about multi-threaded programming in .NET (C#).  Spinning up threads and executing code on another thread isn’t really the hard part.  The hard part is synchronization of data between threads.

Most of what I’ve written about is from a processor agnostic point of view.  It’s written from the historical point of view: that .NET supports many processors with varying memory models.  The stance has generally been that you’re programming for the .NET memory model and not a particular processor memory model.

But, that’s no longer entirely true.  In 2010 Microsoft basically dropped support for Itanium in both Windows Server and in Visual Studio (http://blogs.technet.com/b/windowsserver/archive/2010/04/02/windows-server-2008-r2-to-phase-out-itanium.aspx).  In VS 2012 there is no “Itanium” choice in the project Build options.  As far as I can tell, Windows 2008 R2 is the only Windows operating system, still in support, that supports Itanium.  And even Windows 2008 R2 for Itanium is not supported for .NET 4.5 (http://msdn.microsoft.com/en-us/library/8z6watww.aspx)

So, what does this mean to really only have the context of running only x86/x64?  Well, if you really read the documentation and research the Intel x86 and x64 memory model this could have an impact in how you write multi-threaded code with regard to shared data synchronization.  x86 and x64 memory models include guarantees like “In a multiple-processor system…Writes by a single processor are observed in the same order by all processors.” but does and also includes guarantees like “Loads May Be Reordered with Earlier Stores to Different Locations”.  What this really means is that a store or a load to a single location won’t be reordered with regard to a load or a store to the same location across processors.  That is we don’t need fences to ensure a store to a single memory location is “seen” by all threads or that a load from memory loads the “most recent” value stored.  But, it does mean that in order for multiple stores to multiple locations to be viewed by other threads in the same order, a fence is necessary (or the group of store operations is invoked as an atomic action through the user of synchronization primitives like Monitor.Enter/Exit, lock, Semaphore, etc.) (See section 8.2 Memory Ordering  of the Intel Software Developer’s Manual Volume 3A found here).  But, that deals with non-atomic invariants which I’ll detail in another post.

To be clear, you could develop to just x86 or just x64 prior to .NET 4.5 and have all the issues I’m about to detail.

Prior to .NET 4.5 you really programmed to the .NET memory model.  This has changed over time since ECMA defined it around .NET 2.0; but that model was meant to be a “supermodel” to deal with the fact that .NET could be deployed to different CPUs with disparate memory models.  Most notably was the Itanium memory model.  This model is much looser than the Intel x86 memory model and allowed things like a store without a release fence and a load without an acquire fence.  This meant that a load or a store might be done only in one CPU’s memory cache and wouldn’t be flushed to memory until a fence.  This also meant that other CPUs (e.g. other threads) may not see the store or may not get the “latest” value with a load.  You can explicitly cause release and acquire fences with .NET with things like Monitor.Enter/Exit (lock), Interlocked methods, Thread.MemoryBarrier, Thread.VolatileRead/VolatileWrite, etc.  So, it wasn’t a big issue for .NET programmers to write code that would work on an Itanium.  For the most part, if you simply guarded all your shared data with a lock, you were fine.  lock is expensive, so you could optimize things with Thread.VolatileRead/VolatileWriter if your shared data was inherently atomic (like a single int, a single Object, etc) or you could use the volatile keyword (in C#).  The conventional wisdom has been to use Thread.VolatileRead/VolatileWrite rather than decorate a field with volatile because you may not need every access to be volatile and you don’t want to take the performance hit when it doesn’t need to be volatile.

For example (borrowed from Jeffrey Richter, but slightly modified) shows synchronizing a static member variable with Thread.VolatileRead/VolatileWrite:

   1: public static class Program {

   2:   private static int s_stopworker;

   3:   public static void Main() {

   4:     Console.WriteLine("Main: letting worker run for 5 seconds");

   5:     Thread t = new Thread(Worker);

   6:     t.Start();

   7:     Thread.Sleep(5000);

   8:     Thread.VolatileWrite(ref s_stopworker, 1);

   9:     Console.WriteLine("Main: waiting for worker to stop");

  10:     t.Join();

  11:   }

  12:  

  13:   public static void Worker(object o) {

  14:     Int32 x = 0;

  15:     while(Thread.VolatileRead(ref s_stopworker) == 0)

  16:     {

  17:       x++;

  18:     }

  19:   }

  20: }

 
Without the call to Thread.VolatileWrite the processor could reorder the write of 1 to s_stopworker to after the read (assuming we’re not developing to on particular processor memory model and we’re including Itanium).  In terms of the compiler, without Thread.VolatileRead it could cache the value being read from s_stopworker in to a register.  For example, removing the Thread.VolatileRead, the compiler optimizes the comparison of s_stopworker to 0 in the while to single register (on x86):
 
00000000  push        ebp 

00000001  mov         ebp,esp 

00000003  mov         eax,dword ptr ds:[00213360h] 

00000008  test        eax,eax 

0000000a  jne         00000010 

0000000c  test        eax,eax 

0000000e  je          0000000C 

00000010  pop         ebp 

00000011  ret 

The loop is 0000000c to 0000000e (really just testing that the eax register is 0). Using Thread.VolatileRead, we’d always get a value from a physical memory location:

00000000  push        ebp 

00000001  mov         ebp,esp 

00000003  lea         ecx,ds:[00193360h] 

00000009  call        71070480 

0000000e  test        eax,eax 

00000010  jne         00000021 

00000012  lea         ecx,ds:[00193360h] 

00000018  call        71070480 

0000001d  test        eax,eax 

0000001f  je          00000012 

00000021  pop         ebp 

00000022  ret 

The loop is now 00000012 to 0000001f, which shows calling Thread.VolatileRead each iteration (location 00000018). But, as we’ve seen from the Intel documentation and guidance, we don’t really need to call VolatileRead, we just don’t want the compiler to optimize the memory access away into a register access. This code works, but we take the hit of calling VolatileRead which forces a memory fence through a call to Thread.MemoryBarrier after reading the value.  For example, the following code is equivalent:

while(s_stopworker == 0)

{

  Thread.MemoryBarrier();

  x++;

}

And this works equally as well as using Thread.VolatileRead, and compiles down to:

00000000  push        ebp 

00000001  mov         ebp,esp 

00000003  cmp         dword ptr ds:[002A3360h],0 

0000000a  jne         0000001A 

0000000c  lock or     dword ptr [esp],0 

00000011  cmp         dword ptr ds:[002A3360h],0 

00000018  je          0000000C 

0000001a  pop         ebp 

0000001b  ret 

The loop is now is 0000000c to 00000018. As we can see at 0000000c we have an extra “lock or” instruction—which is what the compiler optimizes a call to Thread.MemoryBarrier to. This instruction really just or’s 0 with what esp is pointing to (i.e. “nothing”, zero or’ed with something else does not change the value). But the lock prefix forces a fence and is less expensive than instructions like mfence. But, based on what we know of the x86/x64 memory model, we’re only dealing with a single memory location and we don’t need that lock prefix—the inherent memory guarantees of the processor means that our thread can see any and all writes to that memory location without this extra fence. So, what can we do to get rid of it? Well, using volatile actually results in code that doesn’t generate that lock or instruction. For example, if we change our code to make s_stopworker volatile:

   1: public static class Program {

   2:   private static volatile int s_stopworker;

   3:   public static void Main() {

   4:     Console.WriteLine("Main: letting worker run for 5 seconds");

   5:     Thread t = new Thread(Worker);

   6:     t.Start();

   7:     Thread.Sleep(5000);

   8:     s_stopworker = 1;

   9:     Console.WriteLine("Main: waiting for worker to stop");

  10:     t.Join();

  11:   }

  12:  

  13:   public static void Worker(object o) {

  14:     Int32 x = 0;

  15:     while(s_stopworker == 0)

  16:     {

  17:       x++;

  18:     }

  19:   }

  20: }

We tell the compiler that we don’t want accesses to s_stopworker optimized.  This then compiles down to:

00000000  push        ebp 

00000001  mov         ebp,esp 

00000003  cmp         dword ptr ds:[00163360h],0 

0000000a  jne         00000015 

0000000c  cmp         dword ptr ds:[00163360h],0 

00000013  je          0000000C 

00000015  pop         ebp 

00000016  ret 

The loop is now 0000000c to 00000013. Notice that we’re simply getting the value from memory on each iteration and comparing it to 0. There’s no lock or. One less instruction and no extra memory fence. Although in many cases it doesn’t matter (i.e. you might only do this once, in which case an extra few milliseconds won’t hurt and this might be a premature optimization), but using lock or with the register optimization is about 992% slower when measured on my computer (or volatile is 91% faster than using Thread.MemoryBarrier and probably a bit faster still than use Thread.VolatileRead).  This is actually contradictory to conventional wisdom with respect to a .NET memory model that supports Itanium.  If you want to support Itanium, every access to a volatile field would be tantamount to Thread.VolatileRead or Thread.VolatileWrite, in which case, yes, in scenarios where you don’t really need the field to be volatile, you take a performance hit.

In .NET 4.5 where Itanium is out of the picture, you might be thinking “volatile all the time then!”.  But, hold on a minute, let’s look at another example:

 

   1: static void Main()

   2: {

   3:   bool complete = false; 

   4:   var t = new Thread (() =>

   5:   {

   6:     bool toggle = false;

   7:     while (!complete)

   8:     {

   9:         Thread.MemoryBarrier();

  10:         toggle = !toggle;

  11:     }

  12:   });

  13:   t.Start();

  14:   Thread.Sleep (1000);

  15:   complete = true;

  16:   t.Join();

  17: }

This code (borrowed from Joe Albahari) will block indefinitely at the call to Thread.Join (line 16) without the call to Thread.MemoryBarrier() (at line 9). 

This code blocks indefinitely without Thread.MemoryBarrier() on both x86 and x64; but this is due to compiler optimizations, not because of the processor’s memory model. We can see this in the disassembly of what the JIT produces for the thread lambda (on x64):

00000000  push        ebp 

00000001  mov         ebp,esp 

00000003  movzx       eax,byte ptr [ecx+4] 

00000007  test        eax,eax 

00000009  jne         0000000F 

0000000b  test        eax,eax 

0000000d  je          0000000B 

0000000f  pop         ebp 

00000010  ret 

Notice the loop (0000000b to 0000000d), the compiler has optimized access to the variable toggle into a register and doesn’t update that register from memory—identical to what we saw with the member field above. If we see the disassembly when using MemoryBarrier:

00000000  movzx       eax,byte ptr [rcx+8] 

00000004  test        eax,eax 

00000006  jne         0000000000000020 

00000008  nop         dword ptr [rax+rax+00000000h] 

00000010  lock or     dword ptr [rsp],0 

00000015  movzx       eax,byte ptr [rcx+8] 

00000019  test        eax,eax 

0000001b  je          0000000000000010 

0000001d  nop         dword ptr [rax] 

00000020  rep ret 

We see that loop testing toggle (instructions from 00000010 to 0000001b) grabs the memory value into eax then tests eax until it’s true (or non-zero). MemoryBarrier has been optimized to “lock or” here as well.

What we’re dealing with here is a local variable and can’t use the volatile keyword.  We could use the lock keyword to get a fence, it couldn’t be around the comparison (the while) because that would enclose the entire while block and would never exit the lock to get the memory fence and thus the compiler believes reads of toggle aren’t guarded by lock’s implicit fences.  We’d have to wrap the assignment to toggle to get the release fence before and the acquire fence after, ala:

var lockObject = new object();

while (!complete)

{

    lock(lockObject)

    {

        toggle = !toggle;

    }

}

Clearly this lock block isn’t really a critical section because the lockObject instance can’t be shared amongst threads.  Anyone reading this code is likely going to think “WTF?”. But, we do get our fences, and the compiler will not optimize access to toggle to only a register and our code will no longer block at the call to Thread.Join.  It’s apparent that Thread.MemoryBarrier is the better choice in this scenario, it’s just more readable and doesn’t appear to be poorly written code (i.e. code that only depends on side effects).

But, you still take the performance hit on “lock or”.  If you want to avoid that, then refactor the local toggle variable to a field and decorate it with volatile.

Although some of this seems like micro-optimizations, but it’s not.  You have to be careful to “synchronize” shared atomic data with respect to compiler optimizations, so you might as well pick the best way that works.

 

In the next post I’ll get into synchronizing non-atomic invariants shared amongst threads.