memory_barriers.txt Barret Rhoden 1. Overview 2. General Rules 3. Use in the Code Base 4. Memory Barriers and Locking 5. Other Stuff 1. Overview ==================== Memory barriers exist to make sure the compiler and the CPU do what we intend. The compiler memory barrier (cmb()) (called an optimization barrier in linux) prevents the compliler from reordering operations. However, CPUs can also reorder reads and writes, in an architecture-dependent manner. In most places with shared memory synchronization, you'll need some form of memory barriers. These barriers apply to 'unrelated' reads and writes. The compiler or the CPU cannot detect any relationship between them, so it believes it is safe to reorder them. The problem arises in that we attach some meaning to them, often in the form of signalling other cores. CPU memory barriers only apply when talking to different cores or hardware devices. They do not matter when dealing with your own core (perhaps between a uthread and vcore context, running on the same core). cmb()s still matter, even when the synchronizing code runs on the same core. See Section 3 for more details. 2. General Rules ==================== 2.1: Types of Memory Barriers --------------------- For CPU memory barriers, we have 5 types. - rmb() no reordering of reads with future reads - wmb() no reordering of writes with future writes - wrmb() no reordering of writes with future reads - rwmb() no reordering of reads with future writes - mb() no reordering of reads or writes with future reads or writes All 5 of these have a cmb() built in. (a "memory" clobber). Part of the reason for the distinction between wrmb/rwmb and the full mb is that on many machines (x86), rwmb() is a noop (for us). These barriers are used on 'normal' reads and writes, and they do not include streaming/SSE instructions and string ops (on x86), and they do not include talking with hardware devices. For memory barriers for those types of operations, use the _f variety (force), e.g. rmb() -> rmb_f(). 2.2: Atomics --------------------- Most atomic operations, such as atomic_inc(), provide some form of memory barrier protection. Specifically, all read-modify-write (RMW) atomic ops act as a CPU memory barrier (like an mb()), but do *not* provide a cmb(). They only provide a cmb() on the variables they apply to (i.e., variables in the clobber list). I considered making all atomics clobber "memory" (like the cmb()), but sync_fetch_and_add() and friends do not do this by default, and it also means that any use of atomics (even when we don't require a cmb()) then provides a cmb(). Also note that not all atomic operations are RMW. atomic_set(), _init(), and _read() do not enforce a memory barrier in the CPU. If in doubt, look for the LOCK in the assembly (except for xchg, which is a locking function). We're taking advantage of the LOCK the atomics provide to serialize and synchronize our memory. In a lot of places, even if we have an atomic I'll still put in the expected mb (e.g., a rmb()), especially if it clarifies the code. When I rely on the atomic's LOCK, I'll make a note of it (such as in spinlock code). Finally, note that the atomic RMWs handle the mb_f()s just like they handle the regular memory barriers. The LOCK prefix does quite a bit. These rules are a bit x86 specific, so for now other architectures will need to implement their atomics such that they provide the same effects. 2.3: Locking --------------------- If you access shared memory variables only when inside locks, then you do not need to worry about memory barriers. The details are sorted out in the locking code. 3. Use in the Code Base ==================== Figuring out how / when / if to use memory barriers is relatively easy. - First, notice when you are using shared memory to synchronize. Any time you are using a global variable or working on structs that someone else might look at concurrently, and you aren't using locks or another vetted tool (like krefs), then you need to worry about this. - Second, determine what reads and writes you are doing. - Third, determine who you are talking to. If you're talking to other cores or devices, you need CPU mbs. If not, a cmb suffices. Based on the types of reads and writes you are doing, just pick one of the 5 memory barriers. 3.1: What's a Read or Write? --------------------- When writing code that synchronizes with other threads via shared memory, we have a variety of patterns. Most infamous is the "signal, then check if the receiver is still listening", which is the critical part of the "check, signal, check again" pattern. For examples, look at things like 'notif_pending' and when we check VC_CAN_RCV_MSG in event.c. In these examples, "write" and "read" include things such as posting events or checking flags (which ultimately involve writes and reads). You need to be aware that some functions, such as TAILQ_REMOVE are writes, even though it is not written as *x = 3;. Whatever your code is, you should be able to point out what are the critical variables and their interleavings. You should see how a CPU reordering would break your algorithm just like a poorly timed interrupt or other concurrent interleaving. When looking at a function that we consider a signal/write, you need to know what it does. It might handle its memory barriers internally (protecting its own memory operations). However, it might not. In general, I err on the side of extra mbs, or at the very least leave a comment about what I am doing and what sort of barriers I desire from the code. 3.2: Who Are We Talking To? --------------------- CPU memory barriers are necessary when synchronizing/talking with remote cores or devices, but not when "talking" with your own core. For instance, if you issue two writes, then read them, you will see both writes (reads may not be reorderd with older writes to the same location on a single processor, and your reads get served out of the write buffer). Note, the read can pass/happen before the write, but the CPU gives you the correct value that the write gave you (store-buffer forwarding). Other cores may not see both of them due to reordering magic. Put another way, since those remote processors didn't do the initial writing, the rule about reading the same location doesn't apply to them. Given this, when finding spots in the code that may require a mb(), I think about signalling a concurrent viewer on a different core. A classic example is when we signal to say "process an item". Say we were on one core and filled the struct out and then signalled, if we then started reading from that same core, we would see our old write (you see writes you previously wrote), but someone else on another core may see the signal before the filled out struct. There is still a distinction between the compiler reordering code and the processor reordering code. Imagine the case of "filling a struct, then posting the pointer to the struct". If the compiler reorders, the pointer may be posted before the struct is filled, and an interrupt may come in. The interrupt handler may look at that structure and see invalid data. This has nothing to do with the CPU reordering - simply the compiler messed with you. Note this only matters if we care about a concurrent interleaving (interrupt handler with a kthread for instance), and we ought to be aware that there is some shared state being mucked with. For a more complicated example, consider DONT_MIGRATE and reading vcoreid. Logically, we want to make sure the vcoreid read doesn't come before the flag write (since uthread code now thinks it is on a particular vcore). This write, then read would normally require a wrmb(), but is that overkill? Clearly, we need the compiler to issue the writes in order, so we need a cmb() at least. Here's the code that the processor will get: orl $0x1,0x254(%edi) (write DONT_MIGRATE) mov $0xfffffff0,%ebp (getting ready with the TLS) mov %gs:0x0(%ebp),%esi (reading the vcoreid from TLS) Do we need a wrmb() here? Any remote code might see the write after the vcoreid read, but the read is a bit different than normal ones. We aren't reading global memory, and we aren't trying to synchronize with another core. All that matters is that if the thread running this code saw the vcoreid read, then whoever reads the flag sees the write. The 'whoever' is not concurrently running code - it is 2LS code that either runs on the vcore due to an IPI/notification, or it is 2LS code running remotely that responded to a preemption of that vcore. Both of these cases require an IPI. AFAIK, interrupts serialize memory operations, so whatever writes were issued before the interrupt hit memory (or cache) before we even do things like write out the trapframe of the thread. If this isn't true, then the synchronization we do when writing out the trapframe (before allowing a remote core to try and recover the preempted uthread), will handle the DONT_MIGRATE write. Anyway, the point is that remote code will look at it, but only when told to look. That "telling" is the write, which happens after the synchronizing/serializing events of the IPI path. 4. Memory Barriers and Locking ==================== The implementation of locks require memory barriers (both compiler and CPU). Regular users of locks do not need to worry about this. Lock implementers do. We need to consider reorderings of reads and writes from before and after the lock/unlock write. In these next sections, the reads and writes I talk about are from a given thread/CPU. Protected reads/writes are those that happen while the lock is held. When I say you need a wmb(), you could get by with a cmb() and an atomic-RMW op: just so long as you have the cmb() and the approrpiate CPU mb() at a minimum. 4.1: Locking --------------------- - Don't care about our reads or writes before the lock happening after the lock write. - Don't want protected writes slipping out before the lock write, need a wmb() after the lock write. - Don't want protected reads slipping out before the lock write, need a wrmb() after the lock write. 4.2: Unlocking --------------------- - Don't want protected writes slipping out after the unlock, so we need a wmb() before the unlock write. - Don't want protected reads slipping out after the unlock, so we need a rwmb() before the unlock write. Arguably, there is some causality that makes this less meaningful (what did you do with the info? if not a write that was protected, then who cares? ) - Don't care about reads or writes after the unlock happening earlier than the unlock write. 5. Other Stuff ==================== Linux has a lot of work on memory barriers, far more advanced than our stuff. Some of it doesn't make any sense. I've asked our architects about things like read_barrier_depends() and a few other things. They also support some non-Intel x86 clones that need wmb_f() in place of a wmb() (support out of order writes). If this pops up, we'll deal with it. I chatted with Andrew a bit, and it turns out the following needs a barrier on P2 under the Alpha's memory model: (global) int x = 0, *p = 0; P1: x = 3; FENCE p = &x; P2: while (p == NULL) ; assert(*p == 3); As far as we can figure, you'd need some sort of 'value-speculating' hardware to make this an issue in practice. For now, we'll mark these spots in the code if we see them, but I'm not overly concerned about it. Also note that none of these barriers deal with things like page talble walks, page/segmentation update writes, non-temporal hints on writes, etc. glhf with that, future self!