ZoneOS home Inside ZONEOS-1 workstation



Interlocked Instructions

What lies beyond the InterlockedIncrement() ?

On This Page

Overview
Simple Race Condition
Effect of volatile Keyword
Cache Coherence
Memory Access Semantics


Overview

Most often all is known about interlocked instructions is that all interlocked operations are performed atomically. Is there something more they could give for programmer? According to MSDN, simple reads and writes to 32-bit values are atomic for correctly aligned variables. Are 32-bit interlocked instructions are redundant because of this? Not, they are essential part of instruction set. First, interlocked instructions combine more than one simple action. Consider InterlockedIncrement() operation: it performs both incrementing and querying of resulting value, using single instruction. Replacing it with several usual instructions will not guarantee atomicity (indivisibility) of overall operation: for example thread could be interrupted by preemptive multitasking when only one part of operation is completed, and another thread could spoil values used by first thread (race condition).


Simple Race Condition

Consider function which is called simultaneously by several threads in our program. There is nothing useful it does, but it was specially designed to be a good example of critical section (part of program which must not be executed by two or more threads at a time, or race condition will occur and program will behave differently than it was expected). In this example, program crashes because of a race condition, but in real situations there could be more silent errors (delayed crash because of a memory corruption, or just wrong results of performed calculations), which are much more difficult to find.


VOID
DoCriticalSection(
)
/*++

Routine Description:

    Code inside a critical section (could crash the program if accessed by
    more than one thread at a time). Variable 'String' is shared by all
    threads (marked as static variable), thus any thread could change
    value of 'String' observed by another thread. Variables is also marked
    as volatile to prevent values to be cached by compiler into CPU
    registers.

Arguments:

    VOID

Return Value:

    VOID

--*/
{
    static volatile PCTSTR String;
    static LONG_PTR Length;

    //
    // Null-pointer exception if called without synchronization by several
    // threads: first thread which leave section sets 'String' to NULL.
    // Second thread, just before leaving section, calculates length of
    // string pointed to by 'String', but 'String' is already NULL and
    // thread crashes with null-pointer exception.
    //

    String = _T("Hello World !");
    Sleep( 1000 );
    Length = _tcslen( String );
    String = NULL;
}

As shown in example, program could crash if operation designed as atomic operation is not actually atomic: there is always a time gap after 'String' is set to valid value, and before 'String' is set to NULL. Inserted Sleep( 1000 ) just underlines this fact. Without this Sleep( 1000 ) probability of crash is much smaller, but it is surely not zero. Generally, it is worth to say: "If inserted Sleep( 5000 ) breaks your code because of a race condition, then your code contains error".


Effect of volatile Keyword

To describe in short what volatile keyword means, consider two blocks of a program, where second block is the same as first but with volatile keyword. Gray text between lines of C code means i386/AMD64 assembler compiled from this code.


{
    BOOL flag = TRUE;

    while( flag );
repeat:
    jmp        repeat
}

{
    volatile BOOL flag = TRUE;
    mov        dword ptr [flag], 1

    while( flag );
repeat:
    mov        eax, dword ptr [flag]
    test       eax, eax
    jne        repeat
}

In first block variable 'flag' could be cached by compiler into a CPU register, because it has not volatile qualifier. Because no one will change value at a register, program will hang in an infinite loop (yes, all code below this block is unreachable code, and compiler such as Microsoft Visual C++ knows about it). Also this loop was optimized in equivalent program with the same infinite loop, but without involving variable initialization and fetching. For those who don't know i386 assembler, 'jmp label' means the same as 'goto label' in C code.

Second block have volatile qualifier and have more complex assembler output (initializing 'flag' with 'mov' instruction, in a loop fetching this flag into CPU register 'eax' with a 'mov' instruction, comparing fetched value with zero with 'test' instruction, and returning to the beginning of the loop if 'flag' was not equal to zero. 'jne' means 'goto if not equal'). This is all because volatile keyword prohibits compiler to cache variable value into CPU register, and it is fetched in all loop iterations. Such code is not always is an infinite loop, because another thread in the same program potentially could change value of variable 'flag' and first thread will exit the loop.

It is important to understand that volatile keyword is just a directive for compiler and it works only at a compile-time. For example, the fact of using interlocked operation differs from just a compiler option, since special assembler commands are produced. Thus, interlocked instructions are most like to hardware directives, and they work at a run-time.


Cache Coherence

To decrease time required for memory access different caches are used: recently accessed memory duplicated in CPU cache which is significantly faster than common memory. Future access to the same address will use data saved in cache, decreasing fetch time. The problems appear in SMP (symmetric multiprocessing) systems, where there are several processors have own cache memory: when one processor changes variable in memory region, used by several processors simultaneously, it actually changes own copy of a variable, located in cache, while shared variable still has original value. This problem could not be solved by using volatile keyword on a shared variable, since this will only guarantee that write-to-memory instruction will present in resulting program, but operations related to cache are still not specified. Of course, it is possible to disable CPU cache, mapping memory as no-cache (PAGE_NOCACHE protection flag in VirtualAlloc() Win32 API function), but along with significant slowdown this imposes some limitations: for example, interlocked instructions may raise hardware exception on no-cache memory.

For correct work of SMP systems data which is stored in cache of more than one processor should be the same in all caches. This means that CPU caches must be synchronized (kept coherent) at a hardware level. But it is important to note that cache synchronization (flow of cache coherency traffic) is made asynchronously with program execution: when one CPU changes value of a shared variable another CPU temporarily observes old value. This means CPU continues execution without waiting for cache coherency operation to be completed. Furthermore, if two variables (a then b) were changed by first CPU, another CPU could observe that b has changed earlier than a.

Interlocked instructions have considerable differences on this point. Exactly, interlocked instruction is a command to make something directly on a physical memory under a locked bus. This means that caches inconsistency does not affects programs where shared variables are accessed with only interlocked instructions (note that both processes, that who reads a variable and that writes to it, should use interlocked instructions).


Memory Access Semantics

Sometimes meaning is put into the fact that one operation is performed earlier or later than another operation. In this case it is said that operation has acquire or release semantics. If results of an operation are always observed by another processes before results of any subsequent operation, then operation has acquire semantics. When results of preceding operations are observed prior to result of an operation itself, operation has release semantics.

Tricky question: why using shared variables (using without interlocked functions) under critical section locks (such as Enter/LeaveCriticalSection() ) is always safe, and another process entered the section the latter will always see correct values previously set in the same section by first process, and will see them in correct order? The thing is that critical section locks use interlocked instructions when entering and leaving the section, and all standard interlocked operations (like InterlockedIncrement() ) have both release and acquire semantics. This means that in case shared variable was changed only in CPU cache, execution of a standard interlocked instruction makes this variable (and all cache memory of current CPU also) to be committed into physical memory before executing interlocked instruction itself, so that another processes will observe changes from interlocked operation after observing changes in another shared variables (this is release semantics). Consider following example:


a++;
InterlockedIncrement( &b );
c++;

Another processes will always see changes in a prior to changes in b, and will always see changes in c after changes in b.

As a more realistic example of required use of interlocked instructions one could mention Peterson's algorithm for two threads of critical section lock (preventing two threads from simultaneous entering into critical section). In this algorithm several shared variables are used, and they must be accessed with interlocked instructions, or otherwise algorithm will not work on SMP machines because of problems with cache coherency.

Itanium processor family (IPF) also supports separate instructions for maintaining different access semantics. For example, along with InterlockedIncrement() with both semantics there are two functions InterlockedIncrementAcquire() and InterlockedIncrementRelease() which uses only acquire or only release semantics correspondingly.







Copyright © 2006-2010 Vasily Tarasov.