2Barret Rhoden
   41. Overview
   52. General Rules
   63. Use in the Code Base
   74. Memory Barriers and Locking
   85. Other Stuff
  101. Overview
  12Memory barriers exist to make sure the compiler and the CPU do what we intend.
  13The compiler memory barrier (cmb()) (called an optimization barrier in linux)
  14prevents the compliler from reordering operations.  However, CPUs can also
  15reorder reads and writes, in an architecture-dependent manner.  In most places
  16with shared memory synchronization, you'll need some form of memory barriers.
  18These barriers apply to 'unrelated' reads and writes.  The compiler or the CPU
  19cannot detect any relationship between them, so it believes it is safe to
  20reorder them.  The problem arises in that we attach some meaning to them,
  21often in the form of signalling other cores.
  23CPU memory barriers only apply when talking to different cores or hardware
  24devices.  They do not matter when dealing with your own core (perhaps between
  25a uthread and vcore context, running on the same core).  cmb()s still matter,
  26even when the synchronizing code runs on the same core.  See Section 3 for
  27more details.
  292. General Rules
  312.1: Types of Memory Barriers
  33For CPU memory barriers, we have 5 types. 
  34- rmb() no reordering of reads with future reads
  35- wmb() no reordering of writes with future writes
  36- wrmb() no reordering of writes with future reads
  37- rwmb() no reordering of reads with future writes
  38- mb() no reordering of reads or writes with future reads or writes
  40All 5 of these have a cmb() built in. (a "memory" clobber).
  42Part of the reason for the distinction between wrmb/rwmb and the full mb is
  43that on many machines (x86), rwmb() is a noop (for us).
  45These barriers are used on 'normal' reads and writes, and they do not include
  46streaming/SSE instructions and string ops (on x86), and they do not include
  47talking with hardware devices.  For memory barriers for those types of
  48operations, use the _f variety (force), e.g. rmb() -> rmb_f().
  502.2: Atomics
  52Most atomic operations, such as atomic_inc(), provide some form of memory
  53barrier protection.  Specifically, all read-modify-write (RMW) atomic ops act
  54as a CPU memory barrier (like an mb()), but do *not* provide a cmb().  They
  55only provide a cmb() on the variables they apply to (i.e., variables in the
  56clobber list).  
  58I considered making all atomics clobber "memory" (like the cmb()), but
  59sync_fetch_and_add() and friends do not do this by default, and it also means
  60that any use of atomics (even when we don't require a cmb()) then provides a
  63Also note that not all atomic operations are RMW.  atomic_set(), _init(), and
  64_read() do not enforce a memory barrier in the CPU.  If in doubt, look for the
  65LOCK in the assembly (except for xchg, which is a locking function).  We're
  66taking advantage of the LOCK the atomics provide to serialize and synchronize
  67our memory.
  69In a lot of places, even if we have an atomic I'll still put in the expected
  70mb (e.g., a rmb()), especially if it clarifies the code.  When I rely on the
  71atomic's LOCK, I'll make a note of it (such as in spinlock code).
  73Finally, note that the atomic RMWs handle the mb_f()s just like they handle
  74the regular memory barriers.  The LOCK prefix does quite a bit.
  76These rules are a bit x86 specific, so for now other architectures will need
  77to implement their atomics such that they provide the same effects.
  792.3: Locking
  81If you access shared memory variables only when inside locks, then you do not
  82need to worry about memory barriers.  The details are sorted out in the
  83locking code.
  853. Use in the Code Base
  87Figuring out how / when / if to use memory barriers is relatively easy.
  88- First, notice when  you are using shared memory to synchronize.  Any time
  89  you are using a global variable or working on structs that someone else
  90  might look at concurrently, and you aren't using locks or another vetted
  91  tool (like krefs), then you need to worry about this.
  92- Second, determine what reads and writes you are doing.
  93- Third, determine who you are talking to.
  95If you're talking to other cores or devices, you need CPU mbs.  If not, a cmb
  96suffices.  Based on the types of reads and writes you are doing, just pick one
  97of the 5 memory barriers.
  993.1: What's a Read or Write?
 101When writing code that synchronizes with other threads via shared memory, we
 102have a variety of patterns.  Most infamous is the "signal, then check if the
 103receiver is still listening", which is the critical part of the "check,
 104signal, check again" pattern.  For examples, look at things like
 105'notif_pending' and when we check VC_CAN_RCV_MSG in event.c.
 107In these examples, "write" and "read" include things such as posting events or
 108checking flags (which ultimately involve writes and reads).  You need to be
 109aware that some functions, such as TAILQ_REMOVE are writes, even though it is
 110not written as *x = 3;.  Whatever your code is, you should be able to point
 111out what are the critical variables and their interleavings.  You should see
 112how a CPU reordering would break your algorithm just like a poorly timed
 113interrupt or other concurrent interleaving.
 115When looking at a function that we consider a signal/write, you need to know
 116what it does.  It might handle its memory barriers internally (protecting its
 117own memory operations).  However, it might not.  In general, I err on the side
 118of extra mbs, or at the very least leave a comment about what I am doing and
 119what sort of barriers I desire from the code.
 1213.2: Who Are We Talking To?
 123CPU memory barriers are necessary when synchronizing/talking with remote cores
 124or devices, but not when "talking" with your own core.  For instance, if you
 125issue two writes, then read them, you will see both writes (reads may not be
 126reorderd with older writes to the same location on a single processor, and
 127your reads get served out of the write buffer).  Note, the read can
 128pass/happen before the write, but the CPU gives you the correct value that the
 129write gave you (store-buffer forwarding).  Other cores may not see both of
 130them due to reordering magic.  Put another way, since those remote processors
 131didn't do the initial writing, the rule about reading the same location
 132doesn't apply to them.
 134Given this, when finding spots in the code that may require a mb(), I think
 135about signalling a concurrent viewer on a different core.  A classic example
 136is when we signal to say "process an item".  Say we were on one core and
 137filled the struct out and then signalled, if we then started reading from that
 138same core, we would see our old write (you see writes you previously wrote),
 139but someone else on another core may see the signal before the filled out
 142There is still a distinction between the compiler reordering code and the
 143processor reordering code.  Imagine the case of "filling a struct, then
 144posting the pointer to the struct".  If the compiler reorders, the pointer may
 145be posted before the struct is filled, and an interrupt may come in.  The
 146interrupt handler may look at that structure and see invalid data.  This has
 147nothing to do with the CPU reordering - simply the compiler messed with you.
 148Note this only matters if we care about a concurrent interleaving (interrupt
 149handler with a kthread for instance), and we ought to be aware that there is
 150some shared state being mucked with.
 152For a more complicated example, consider DONT_MIGRATE and reading vcoreid.
 153Logically, we want to make sure the vcoreid read doesn't come before the flag
 154write (since uthread code now thinks it is on a particular vcore).  This
 155write, then read would normally require a wrmb(), but is that overkill?
 156Clearly, we need the compiler to issue the writes in order, so we need a cmb()
 157at least.  Here's the code that the processor will get:
 158        orl    $0x1,0x254(%edi)      (write DONT_MIGRATE)
 159        mov    $0xfffffff0,%ebp      (getting ready with the TLS)
 160        mov    %gs:0x0(%ebp),%esi    (reading the vcoreid from TLS)
 162Do we need a wrmb() here?  Any remote code might see the write after the
 163vcoreid read, but the read is a bit different than normal ones.  We aren't
 164reading global memory, and we aren't trying to synchronize with another core.
 165All that matters is that if the thread running this code saw the vcoreid read,
 166then whoever reads the flag sees the write.
 168The 'whoever' is not concurrently running code - it is 2LS code that either
 169runs on the vcore due to an IPI/notification, or it is 2LS code running
 170remotely that responded to a preemption of that vcore.  Both of these cases
 171require an IPI.  AFAIK, interrupts serialize memory operations, so whatever
 172writes were issued before the interrupt hit memory (or cache) before we even
 173do things like write out the trapframe of the thread.  If this isn't true,
 174then the synchronization we do when writing out the trapframe (before allowing
 175a remote core to try and recover the preempted uthread), will handle the
 176DONT_MIGRATE write.
 178Anyway, the point is that remote code will look at it, but only when told to
 179look.  That "telling" is the write, which happens after the
 180synchronizing/serializing events of the IPI path.
 1824. Memory Barriers and Locking
 184The implementation of locks require memory barriers (both compiler and CPU).
 185Regular users of locks do not need to worry about this.  Lock implementers do.
 187We need to consider reorderings of reads and writes from before and after the
 188lock/unlock write.  In these next sections, the reads and writes I talk about
 189are from a given thread/CPU.  Protected reads/writes are those that happen
 190while the lock is held.  When I say you need a wmb(), you could get by with a
 191cmb() and an atomic-RMW op: just so long as you have the cmb() and the
 192approrpiate CPU mb() at a minimum.
 1944.1: Locking
 196- Don't care about our reads or writes before the lock happening after the
 197  lock write.
 198- Don't want protected writes slipping out before the lock write, need a wmb()
 199  after the lock write.
 200- Don't want protected reads slipping out before the lock write, need a wrmb()
 201  after the lock write.
 2034.2: Unlocking
 205- Don't want protected writes slipping out after the unlock, so we need a
 206  wmb() before the unlock write.
 207- Don't want protected reads slipping out after the unlock, so we need a
 208  rwmb() before the unlock write.  Arguably, there is some causality that
 209  makes this less meaningful (what did you do with the info? if not a write
 210  that was protected, then who cares? )
 211- Don't care about reads or writes after the unlock happening earlier than the
 212  unlock write.
 2145. Other Stuff
 216Linux has a lot of work on memory barriers, far more advanced than our stuff.
 217Some of it doesn't make any sense.  I've asked our architects about things
 218like read_barrier_depends() and a few other things.  They also support some
 219non-Intel x86 clones that need wmb_f() in place of a wmb() (support out of
 220order writes).  If this pops up, we'll deal with it.
 222I chatted with Andrew a bit, and it turns out the following needs a barrier
 223on P2 under the Alpha's memory model:
 225        (global) int x = 0, *p = 0;
 227        P1:
 228        x = 3;
 229        FENCE
 230        p = &x;
 232        P2:
 233        while (p == NULL) ;
 234        assert(*p == 3);
 236As far as we can figure, you'd need some sort of 'value-speculating' hardware
 237to make this an issue in practice.  For now, we'll mark these spots in the code
 238if we see them, but I'm not overly concerned about it.
 240Also note that none of these barriers deal with things like page talble walks,
 241page/segmentation update writes, non-temporal hints on writes, etc.  glhf with
 242that, future self!