Creating a thread safe producer consumer queue in C++ without using locks

Yesterday, someone in the newsgroups asked a question about synchronization problems he was having. I explained that –if properly programmed– a consumer producer queue does not need synchronization even on a multiprocessor machine.


A Microsoft employee then asked me to explain exactly how this was done. Ie. To put my money where my mouth is.


The idea


A queue is a chunk of memory that can be divided into equal sized pieces. To simplify this you can think of a queue simply as a traditional 1 dimensional C style array.


You should also note that there is 1 reader, and 1 writer. Each will work in his own thread. This means that reads can happen concurrent with writes, but there will be no multiple concurrent reads or multiple concurent writes. It is possible to provide that level of concurrency without locks (I did, once) but that is significantly more complex, and not the purpose of this example.


To manage the producer and consumer part, we need to have 2 additional variables: a read index and a write index.


The read index is the index of the element right after the element that was last read by the consumer.


The write index is the index of the element right after the element that was last written by the producer.


A writer can always add elements in the queue as long as it can move the write pointer without overtaking the read pointer. An overtake would mean that the queue would overflow.


There are 3 possible cases:


  1. The read and write pointers are equal. This is only possible if the queue is in the initialized state. Any read will result in a fail, any write will succeed.
  2. The read pointer is pointing to the first element after the write pointer. This means that the queue is full. If one more element would be added, the read and write pointers would be the same, and the reader would assume that the queue was empty. Any read will succeed, and write will fail.
  3. The read and write pointers are not equal, and the previous case was not true. Any read will succeed. Any write will succeed.

The implementation


Adding elements


 


      bool PushElement(T &Element)


      {


            int nextElement = (m_Write + 1) % Size;


            if(nextElement != m_Read)


            {


                  m_Data[m_Write] = Element;


                  m_Write = nextElement;


                  return true;


            }


            else


                  return false;


      }


As long as the first element after the write pointer is not the read pointer, it is not occupied. The element can be added and the write pointer can be updated.


Otherwise the queue is full and the write operation will fail.


Removing elements


 


      bool PopElement(T &Element)


      {


            if(m_Read == m_Write)


                  return false;


 


            int nextElement = (m_Read + 1) % Size;


            Element = m_Data[m_Read];


            m_Read = nextElement;


            return true;


      }


If the read pointer is equal to the write pointer, there is nothing to read. In any other case we can read the first element and update the read pointer.


Why this is safe


The advantage of this scheme is that no locking is needed. Each of the pointers (indices really) is updated only by one thread.


This means there is no write concurrency as far as the pointers are concerned.


Furthermore – and this is important – the read and write pointers are updated only after the element concerned is read or written.


This means that if the read or write pointer changes, the elements are guaranteed to reflect that situation. The gap between the element change and the pointer change is invisible to the other threads.


So logically, this code is perfectly safe. However, one final issue must be highlighted, and that is the declaration of the pointers and the data array.


      volatile int m_Read;


      volatile int m_Write;


      static const int Size = 10;


      volatile T m_Data[Size];


If we take no precautions, the optimizer can reorder the actual memory operations on the read and write pointer.


Or it could assume that variables do not change between calls and use cached values instead of what is actually in memory. This could be especially problematic on multi CPU machines where each CPU has its own cache.


See the documentation for ‘volatile’ in MSDN for a more detailed explanation.


How this could be unsafe


There is one thing that is critical in this code, and that is that memory updates are not re-ordered. You can easily understand that if the write pointer gets updated before the actual data element is updated, there is a gap that creates a race condition.


Therefore it is critical that the platform this code runs on (CPU, chipset, memory controller) does not re-order the memory write operations. Otherwise you’d have this code that is logically thread-safe, but thread-unsafe due to CPU peculiarities. You’d have all sorts of impossible to debug problems.


Extending functionality


The design as it is works reasonably well, but here are a couple of tips to make this into powerful code:


  • Foresee an array of read pointers instead of using only one. That way you could have a broadcast queue in which each consumer has the opportunity to receive elements. You have to check each of the read pointers before writing to make sure that there is room enough in the queue.
  • Allow the constructor to take a variable sized piece of memory that was previously constructed. Make the read and write pointers the first items in that memory. This allows you to put a queue on shared memory and communicate with other processes.

Conclusion


Making a multi thread safe consumer producer queue without synchronization is perfectly possible. If you want to play with the code, it is attached to this article and licensed under the MIT license.


I whipped this code together in a very short time for the purpose of this demo. It should be bug free, but if you think there is something specifically wrong, let me know and I’ll see if there is a problem with it.

30 thoughts on “Creating a thread safe producer consumer queue in C++ without using locks”

  1. VC2005 inserts appropriate memory barriers for volatile accesses when you use a CPU that requires them. The code should be safe for all CPUs with VC2005. This is also true for Intel C++ (although volatile’s behaviour is controllable by a compiler switch there, so care is required).

    Another possible enhancement would be to drop the use of volatile, and instead insert the appropriate memory barrier operations directly (the old “volatile is neither necessary nor sufficient in general” adage). This is more explicit, and can be more efficient on some architectures, since you are explicitly specifying the exact barriers you require, which gives the optimizer more freedom.

    Another enhancement you could add would be blocking read and write methods, though that’s trickier, since you need a fast-pathed event object to avoid kernel calls where possible.

  2. i don’t get it. ” Each of the pointers (indices really) is updated only by one thread.”. Why? can you please explain a bit more?

  3. Hi Roy,

    The idea of a producer consumer queue is that there is one producer shoving items in on one end, and 1 consumer pulling out items on the other end.
    Formally this is known a a first-in first-out queue mechanism (FIFO).

    Only the producer updates the write pointer, and only the consumer updates the read pointer.
    Because of this there is no reason to synchronize access to the pointers.

    Since the update is done AFTER the data access, there is no risk of overwriting items before they are read, or reading an item before it’s completely filled in.

  4. I just want to know is it possible to
    Create a thread safe producer consumer CircularQueue in C++ without using locks …

  5. > Kumar said:
    >
    > I just want to know is it possible to
    > Create a thread safe producer consumer
    > CircularQueue in C++ without using locks …

    Yes, but the code above won’t generally work unless.
    1. The m_Read and m_Write updates are atomic
    2. PushElement: Has a memory barrier around the data copy and m_Write update
    3. PopElement: Has a memory barrier around the m_Write access in the if() clause and the data read.

  6. OK this is royally late, but to G: this will work in a multithreaded situation without any problems. EDIT: Apparently I misunderstood G. If there is more than 1 reader or 1 writer, then this code will not work as intended.

  7. I’m not sure about what Biff said, (#2,3)

    Why do we need a memory barrier around data copy and m_write for instance?

    There won’t be any reordering for those in a single thread.
    Thus we don’t need to worry about those here(in 2 threads) either.

  8. I got a question about the multithreaded case.

    >OK this is royally late, but to G: this will work in a >multithreaded situation without any problems.

    bool PopElement(T &Element)

    {

    if(m_Read == m_Write)

    return false;

    **** what if some other thread changed the value of m_Read at this point?

    int nextElement = (m_Read + 1) % Size;

    Element = m_Data[m_Read];

    m_Read = nextElement;

    return true;

    }

  9. Have fun crashing just by looking at those two lines you should obviously see that this cannot be threadsafe:

               int nextElement = (m_Write + 1) % Size;

    EDIT: Yes this is threadsafe. Size does not change over the lifetime of the object, and m_Write is only written to by the writer

               if(nextElement != m_Read)

    Unless your CPU has an atomic +1 % Size compare instruction

    It will do something like

    1. Reg = 1

    2. Reg = Reg + m_Write

    3. Reg = Reg % Size

    4. Compare Reg, m_Read

    If another thread now updates m_Write during step 3 and 4 you have a serious Problem.

    EDIT: Since the queue has only 1 writer, there is no problem. With multiple writers your point is valid, but this queue is meant for 1 writer hread and 1 reader thread.

    Same goes for the position counter updates when you want multiple writers or readers you’d need to use Interlocked functions here or a Mutex.

    EDIT: True, but again this class is only meant for single reader and single writer scenarios

    The minimal needs to get a single reader/ single writer queue  threadsafe are 2 Semaphores to handle the full/empty checks.

    EDIT: No, it is not. 32 bit reads and writes are atomic. The writer checks before each update to see if there is at least one free element. It does not matter if the reader reads more or not, because that will only make more items available.

    Similarly, If the reader checks if there is at least 1 element in the queue, the write pointer may be updated or not it doesn’t matter. Only 1 element will be read or written at a time, so if the write pointer changes, that doesn’t matter.

    For adding multiple multiple readers and writers you’ve to make the m_Read/m_Write increments atomic. For example use InterlockedIncrement or place a Mutex around the lines.

    This problem has extensivly studied btw:

    en.wikipedia.org/…/Producers-consumers_problem

    EDIT: I just read the article, and the only reason semaphores are needed is because the reader and writer share a wakeup mechanism. The buffering itself does not need semaphores or protection in the single reader single writer scenario.

    My class does not have or need a wakeup mechanism, so they don’t need semaphores either.

    For safe wakeup you need a semaphore indeed (which is why my production code has them for exactly this feature) but the queueing mechanism itself needs no protection. Which you could have figured out for yourself, had you read the article more closely

    To proof invadility of your code you could do this btw:

    int nextElement = (m_Read + 1) % Size;

    Sleep(10 + rand() % 10);

    Element = m_Data[m_Read];

    EDIT: Yes I know, but ONLY if there are multiple readers! The m_Read is only updated in PopElement, and only 1 thread is ever calling that Method, so no matter how long you sleep in there, it will never go wrong.

    Now make 1 writer thread that writes in 1 2 3 4 5 6 etc

    And make 10 reader threads you’ll for sure get duplicate numbers and some missing numbers in the output.

    Yes indeed. That’s why it’s described as a class for a single reader and single writer. Which I also explain under the heading ‘Why this is safe’

     

  10. You need to change your reply to G. G was already asking what if there are multiple readers or multiple writers (and you obviously misunderstood his question). That’s why both Allen and noobo critiqued your code (and you never clarified in your original article that it’s a single-producer single-consumer que).

  11. I just read back, and I didn’t explicitly say that it was single reader / single writer.

    I mention ‘the reader’ and ‘the writer’, and that it is safe because only 1 thread will do either reading or writing so it was implied.

    But not obvious enough unfortunately.

  12. That is correct. The queue needs to contain more than 1 element. Otherwise the system of index calculations does not work.

    Every queue that I implemented using this class was expected to contain a known maximum amount of items.
    This number was calculated by looking at its purpose, and then putting a sensible number on it, multiplied by 10 or 100 to be safe.

    Most queues only contained short items (32 -128 bytes) so the overhead was minimal.

    Even if that was insufficient, that error condition was handled and properly diagnosed so that it could be changed.

    There are 3 reasons we didn’t use dynamic allocation:
    – performance. This was the memory only needs coyping, not allocating which could fragment memory, and would also cost time
    – reliability. Once the queues exist, no error conditions can occurr anymore (no failed allocations).
    – constraints. Most of those queues were layed out in a special block of memory on a PCI card, which was mapped into the address space. That memory was replicated via an optical interface so that all computers saw the same data. Since the location of anything had to be known by everyone, dynamic allocation would make things much more complex, error prone and less performant.

  13. Very useful code, thanks. I was had exactly the same requirement when I came across it.
    One note: the actual FIFO depth (the effective buffering achievable) is Size – 1, not Size.

  14. This implementation does not work for one reason, and is not useful for much more than a thought experiment for another reason. First, it doesn’t work because of the following race condition in PushElement:

    1 int nextElement = (m_Write + 1) % Size;
    2 if(nextElement != m_Read)
    3 {
    4 m_Data[m_Write] = Element;
    5 m_Write = nextElement;
    6 return true;
    7 }

    If any producer thread gets to line 2 while any other producer thread is executing 2-5, both will have the same m_Write index, and an item will be lost in the queue.

    This is a fairly obvious issue, and also highly likely given how much time other threads have to race in and screw things up. This is a major problem.

    However, that aside, the second major problem is more related to the intended functionality. This queue provides no way for a reader thread to idle on an empty queue. Therefore, readers are doomed to either poll the queue in tight loops when it is empty (not ideal if your CPUs are busy doing other things) or they must insert arbitrarily long pauses when the queue is empty (not ideal if latency matters). This problem makes this implementation not useful in most real scenarios, in addition to not being thread-safe.

    JC

  15. PS, a similar and equally obvious race condition exists in pop:

    1 int nextElement = (m_Read + 1) % Size;
    2 Element = m_Data[m_Read];
    3 m_Read = nextElement;

    It should be easily identifiable. If any consumer thread gets to 2 while any other consumer thread is executing 1-3 then both consumer threads use the same value of m_Read, thus returning the same item twice.

    This implementation is far from thread-safe, and not for any subtle reasons having to do with memory barriers, optimizations, etc. It’s just plain broken. Locking around the lines I mentioned to make the operations atomic would provide the necessary safety.

    A second option on Windows is to modify the read/write indices using InterlockedIncrement(), which does atomic increment and returns of an integer (search for InterlockedIncrement on msdn.microsoft.com, there are many other Interlocked* functions as well), instead of attempting to do it over a few non-atomic statements and crossing your fingers that another thread doesn’t run at the same time.

    JC

  16. You are absolutely right, but

    1) I already explained that this queue is supposed to work with 1 reader and 1 writer. And yes, I know that if that isn’t true, it won’t be safe.
    2) With multiple readers / writers you can still make it threadsafe without locks, but then every reader needs its own read pointer, and each write pointer needs its own write pointer.
    3) You can add a semaphore or event object to signal once for each element that gets written in the queue and solve the problem you mentioned

  17. This code should work the same in Java as C++, or any other language. There is one thing that you need to check into, and that is automatic optimization.

    I use the keyword ‘volatile’ to make sure that the compiler doesn’t move the reads / writes to the indices.
    I don’t know the equivalent in java. You’ll have to figure that out yourself, or you could disable optimization for that specific class so that the compiler doesn’t do something silly.

  18. Another way the code could be unsafe, that you don’t mention in your article, is if a thread is interrupted during the update of the read or write pointer. Sure, you claim in another comment that “32 bit reads and writes are atomic” but, of course, that statement does not generally hold true.

  19. My comment was in the context of that reply.

    If you look at the code, you see that I specify the indices to be int. And int is defined to be the natural integer size on a given platform. On a 16 bit platform it is 16 bits. On a 32 bit platform it is 32 bits, and on a 64 bit platform it can be 32 or 64, depending on architecture.

    But in each of those cases, int will be written atomically because it is the natural integer size, and it is one of the few certainties you can have when programming on any platform.

    so even if you could find a platform that does not follow this rule (don’t think it exists) you can hardly claim that my statement is ‘generally not true’

  20. Int is also defined to be >= 16 bits, so they are seldom atomic on 8-bit platforms. And the answer is yes, C++ is used there too… :)

    There is no guarantee what-so-ever that int is always atomic. Even on 32-bit CPUs the memory subsystem may be limited to 16-bit, requiring two bus cycles for int access, possibly breaking atomicity. Unaligned accesses may also be non-atomic.

    Ok, on x86 with MSVC++ it may be a valid assumption, I don’t know. But the vast majority of computer systems in the world does not belong to that specific category, hence the ‘not generally true’. For sure, the C++ standard doesn’t give any such guarantees, and code like this that depend on implementation specific details should come with warning labels like the one you already have for hardware reordering.

    C++0x will address the concurrency issues, probably because this type of synchronization just cannot be done portably and safely in plain C++. (From what I understand, C++0x will also make your method undefined behavior?)

  21. Let me put it a different way then. On a platform that is at least 16 bit, and with an architecture containing the same ‘bittiness’ :) throughout the entire platform, this should be safe.

    I should probably have qualified this up front.
    It’s just that -when talking about computers with a keyboard and monitor- I don’t consider 8 bit controllers, or whacky 16 bit architectures with an 8 bit memory bus, or things like that

    It’s true that such things exist, it’s just not something I consider the norm. Probably because I don’t deal with those.

  22. Hi, so I wonder, where would we see 8bits nowadays? I can imagine some embedded applications or mobile phones or …? And what is the solution for those? Just locks?

    I was thinking you could actually use this pattern and still get to the multiple readers/writers by simply implementing joins and splits:

    Multiple writers:

    Writer thread A writes to QueueA
    Writer thread B writes to QueueB
    Join thread J reads QueueA and QueueB and writes to QueueX
    Reader thread X reads QueueX

    Multiple readers:
    Writer thread X writes to QueueX
    Split thread S reads QueueX and writes to QueueA and QueueB
    Reader thread A reads QueueA
    Reader thread B reads QueueB

    You keep the 1 to 1 model….

  23. Exactly. This is how the system core of the application I mentioned operates.

    The system core acts as a sort of telephone exchange. the clients send messages to the core, and indicate in the header which recipient the message should go to.

    This does have the overhead of having to copy the message a number of times, but when working with short messages (128 bytes I think) it is negligable compared to the context switching that goes on to activate / deactivate a process or thread.

    But even on a single CPU P4 I could a throughput of 40000 messages going from one process to another, being processed and acknowledged.
    On a hyper threaded dual Xeon CPU system, this number ran into the millions.

    technically, it is also possible to make a no-lock multi reader single writer.
    In that case, each reader and has their own index. The message array remains the same.
    An element is only marked ‘free’ if it was read by all readers. The reader indices may changes even if the writer is looking at it, but this is not an issue because the reader will only change the index AFTER it has read. Meaning, that worst case, the writer sees no more room, even if there is. But an element will never be overwritten.

  24. I worked at Matrox 20 years ago and we had a similar producer/consumer fifo for interface between a graphic card and the PC. There is one assumption (for intel platform) that you do not mention: the integer must be aligned on a 32 bit boundary. if you do not do it then the read or write will not be atomic and you will have problems between the producer and the consumer thread.

    if you force the indexes to be in memory locations not divisible by 4, something like:
    #pragma pack(1)
    struct fifoInfo
    {
    char something;
    int readIndex;
    int writeIndex;
    volatile T m_Data[Size]
    };
    the read & write index will not be updated atomicly. unless the latest intel architecture has change tremendously.

  25. Aieee….

    Yes you are right. I know this is true. The default compiler settings of gcc and vc make this a non issue in normal circumstances, but you are right that changing the default packing size can cause serious problems if the integers are not naturally alligned.

    I should have though to mention it. I’ll see if I can update my article. Thanks.

  26. Hi Bruno,

    thank you a lot for this useful article.

    I’d like to ask you for an expert advice :-) since I’m not sure about 1 thing. I was considering why elements in the queue need to be volatile.

    Proper instruction ordering is still ensured even if queue is not volatile: since m_Read and m_Write are volatile, and they have proper Acquire/Release semantics – the compiler will not reorder memory instructions and so flag is always set after the queue has really been updated.

    So it appears to me that the only important reason why the queue is volatile is when there are multiple processors, each one having its own cache. That way reads/writes to the queue are immediately reflected in the shared memory, am I right ?

    I’m asking because type T in my case is a class which is not under my control, and compiler insists on having operator= for that class defined as volatile.

    And one more question: If I’m right with the above presumption, then even if I do use locks to synchronize access to the queue, it still needs to be volatile to work in multi-processor enviroments. Is this correct… ?

    I thank you in advance for your reply,

    S. Angelovic

  27. Hi,

    To have atomic manipulation on basic types like integers (readIndex and writeIndex ) you can use (only under linux with the pthread library) the __sync_fetch_and_add() function to increment it for example.
    I guess you have to align your basic type.
    Theses atomic functions are made to be used with basic types, not structures.

    Nicolas.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>