2Barret Rhoden
   41. Overview
   61.1 What are they?
   8UCQs are a tool to send messages, with payloads, from the kernel to a process
   9through shared memory.  The depth of the queue is unlimited, barring running
  10out of memory.  They should be used in closed loop or low rate scenarios that
  11require a payload (or that building a system that can handle missing a message
  12is unwieldy).
  14The old BCQs were designed for a shared memory segment, such as procdata.
  15Since the kernel is sending the messages, we can use a more customized
  16solution.  The problems in this area are running out of memory, having the
  17kernel use user-provided-pointers, and synchronizing within the kernel (like a
  18spinlock for a TAILQ). 
  20The basic plan for these "unbounded concurrent queues" (UCQ) is to have linked
  21mmap'd pages of arrays of event messages (the BCQ is a circular array of event
  22messages, roughly).  As the kernel over-produces, it mmaps more pages and links
  23them together (via a header at the beginning of the page).  Userspace munmaps
  24when it is done with a page.  To avoid excessive mmap/munmap, we double-buffer
  25with two pages, one current and one spare.  We only need to mmap new ones when
  26the kernel gets far ahead of the user.
  28- When we run out of room, the kernel will implicitly mmap another page,
  29  solving the memory allocation issue.  The worst thing userspace can do is
  30leak its memory, or cause an OOM condition, which we have to deal with anyway.
  32- Using user-pointers isn't conceptually a problem any more, so long as the
  33  kernel reads it in and verifies the address is in userspace (and that it can
  34handle page faults).  This wasn't the case a couple years ago when I made the
  35BCQs.  We still are careful about pointers - we only use them when messing with
  36which page is current, etc, and not when atomically adding items.
  38- Swapping pages/buffers requires locking, but we can't put the lock in the UCQ
  39  structure, since userspace could muck with it.  Instead, we lock at the
  40process level.  And instead of grabbing the process lock, we'll grab a hash
  41lock (hash table of locks, hashed on the pointer of the UCQ).  This will happen
  42every ~170 events or so.  Synchronization for normal ops (not buffer swaps) are
  43done with atomics.
  45Another option instead of allocating more memory would be to have the kernel
  46block kthreads until the queue empties.  I really dislike that for a couple
  47reasons.  It's easier to handle running out of memory than spawning too many
  48kthreads, esp in critical parts of the code (where we can't sleep on memory
  49allocation).  More importantly, it's a real pain to block on a user's actions.
  50For instance, I'm not about to put a semaphore in user-writable memory.
  521.2 Why do we have them?
  54Enqueuing messages in BCQs could fail, due to either a misbehaving process, an
  55unlucky circumstance, or most commonly because the queue was full.  Event
  56queues, which used BCQs as a building block, would handle the failure as an
  57'overflow' of events, and signal the process with a bit.  This means the
  58program needs to know how to handle overflow, which becomes painful.
  60A specific case was syscall completion events.  When a syscall was done, the
  61kernel would send a message that a particular syscall was done.  Userspace
  62needed to know exactly which one was done, under normal circumstances.  With
  63BCQs, the 2LS needed to know how to handle overflow, which means it needs to
  64track every outstanding syscall so that it can poll to see which call
  65completed.  To do this in a scalable manner required each vcore to have its own
  66TAILQ of oustanding uthreads (blocked on syscalls).  The problem with this is
  67that blocked syscalls are now tied to vcores, and that vcore might not want to
  68yield even if it has no work to do (since it needs to receive its message).
  70Ultimately, the issue is that complicated systems could benefit from not
  71needing to handle overflow.  For certain event queue usages, the unbounded
  72nature of UCQs make the system far simpler.  When we build other systems in
  73userspace, such as using ev_qs for syscalls and uthread blocking, then we can
  74leverage the UCQs.
  76Note the main downfall of UCQs is that a process can die if it the kernel runs
  77out of mememory while trying to send it messages.  If the kernel sends messages
  78faster than the process can handle them (for whatever reason, including being
  79swapped out), eventually we'll run out of memory.  UCQs need to be used
  80carefully, such that the process isn't requesting an *unbounded* amount of
  81messages at one time.
  83The main benefit of UCQs in this situation is that they can handle spikes of
  84messages and they remove the need to handle overflow.  Using BCQs, we'd need to
  85handle overflow even if it was unlikely, and the impact of this extended beyond
  86actually handling overflow.  Check out the old overflow handling code (and
  87maybe some notes in the Documentation) for details about how we needed to do
  88work for every syscall, in the off chance we had to handle overflow.
  902. How do they work?
  922.1 Producer (kernel)
  94Producers atomically fight for slots in prod_idx, which encode both the page
  95and the msg number within the page.  If the slot is good, we just write our
  96message.  If it isn't, things get interesting.  We need to synchronize, so we
  97lock externally to the ucq.  This is where the hash lock comes in.  A bad slot
  98is one in which there is no page or the message slot is greater than the array
  99of msgs in the page.
 101Whoever grabs the lock first will be responsible for getting a new page and
 102resetting the counter (which tracks which slot is next).  All future lockers
 103can tell that the work is done by examining the counter and trying to get a
 104slot.  If they got a good one, everything is okay and it is just like they got
 105a good slot in the first place (with the exception of a lock/unlock pair).
 107One minor problem is that with sufficient producers and a delayed fixer (who
 108holds the lock), we could overflow the prod_idx (around 3900 would overflow
 109into a new page on the 0th slot).  To avoid this, the first producer to detect
 110the slot/index is no good will set an overflow flag.  All future producers will
 111check this flag before attempting to get a slot, and if we're overflowing, they
 112will jump to the "grab lock" phase.  This limits the window of vulnerability.
 113In theory, you could have 3900 producers at exactly the same time all fetch and
 114add, before any of the others have a chance to set the overflow.  Not only does
 115this require 3900 cores, but they all have to be writing to the exact same UCQ.
 116If this happens, we have a warning that will go off and I'll fix it.
 118So the last part to deal with is getting a new page and linking it with the old
 119one.  We simply atomic_swap on the spare_pg.  If there is a spare already
 120mmaped, we use it.  If there isn't, we need to mmap a new page.  Either way, we
 121tell the old page to follow to the new page.  We set the index to a good value,
 122then unlock (and clear the overflow flag).
 124When we set the counter, we set it to 1, instead of 0, thereby reserving slot 0
 125for ourselves.  This prevents a DoS from the program.  The user could muck with
 126the prod_idx endlessly, in a way that seems benign.  To prevent this, we make
 127sure that any time the process grabs the hashlock that it gets a good slot.
 128Whenever we fail to get a slot we lock, and whenever we lock we get a good
 131Side note: another related DoS involves using CAS.  You can't CAS with
 132userspace unless you're willing to fail.  O/W, you could be DoS'd.
 134When we have a slot, we write in the ev_msg, and then toggle the 'ready' flag
 135in the message container.  This is in case the consumer is trying to
 136concurrently read our message while we are writing.
 138And finally, every time the kernel reads or writes to something, we need to
 139verify the address (which is user-supplied).  Eventually, the kernel can handle
 140a page fault on a user address, but we can't tolerate KVAs in UCQs.  Note that
 141old versions allowed UCQs/BCQs to be at KVAs (procdata), but that behavior will
 142probably go away soon (error prone, we now handle user pointers anyway, and we
 143usually load 'current' when dealing with a proc).
 1462.2 Consumer
 148Consumers CAS (compare-and-swap) on claiming a slot.  We need to use CAS, since
 149we don't want to advance unless the queue is not empty.  If they detect the
 150slot is bad, they fight among themselves for a lock, which is embedded in the
 151UCQ, and the winner sets the counter to track the next page, frees extra pages,
 154Emptiness is defined as the prod_idx == cons_idx.  Both the producer's and the
 155consumer's *next* slot is the same, so there is no items to be consumed.  If
 156the prod_idx is bad, the consumer needs to wait til there is a good next page
 157(the kernel is in the process of finding and filling a good slot).  If the
 158cons_idx is bad, it means we need to go to the next page (lock first).
 160When locking, the consumer does the same trick as the producer: try to get a
 161slot again, in case someone else got the lock before you and fixed things up.
 163So once we have the next page (it has been posted by the kernel), we can set up
 164the cons_idx so consumers can grab slots.  We don't need to reserve a slot for
 165ourselves (no DoS risk).
 167After setting up the counter, our next job is to *free* the old page of
 168consumed items.  The trick here is that there may be consumers still using it.
 169We statically know how many items there are on the page, so we have all
 170consumers increment another counter when they are done with the slot.  When
 171that counter is the max, we know everyone is done with the page and it can be
 172*freed*.  That counter is like an inverted reference count.  I don't care when
 173it is 0, I care when it is the max.  Alternatively, we could have initialized
 174it to MAX and decremented, but this way felt more natural.  When the page is
 175done, we don't actually free it.  We'll atomic_swap it with the spare_pg,
 176making it the spare.  If there already was a spare, then we have too many pages
 177and need to munmap the old spare.
 179So we finally get our good slot, and we spin til the kernel has loaded it with
 180a message.  Then we just copy it out, increment the 'number consumed' counter,
 181and we're done.
 1833. Misc Other Notes:
 185I think most of these things were discussed inline.  There are a few random
 186things worth mentioning, some of which are obvious in hindsight:
 187- The kernel can't CAS and also not fail, o/w it can be DoSed.
 189- Without CAS, the kernel can't atomically change the page and the index unless
 190  they are the same long / modifiable in the same atomic statement.  This is
 191why the page and the counter/slot number are put together in prod_idx.
 193- Kernel writers/producers need to stop/delay while another producer fixes
 194  things up.  The only acceptable place to delay/spin is on the proc's
 195hashlock.  Any other scheme will probably fail, even if all you want to do is
 196swap the spare page and reset the counter.It doesn't work for the same reason
 197we have the current page and the index in one memory location.
 199- The prod_idx and cons_idx are both protected by the lock and atomically
 200  incremented.  The lock protects who resets them, with the understanding that
 201it should only be done when everyone agrees the counter is bad (too large), but
 202the normal atomics happen lock-free.
 204- Userspace should mmap a huge chunk of memory and then pass in page addresses
 205  to the ucq_init() function.  I made the init function so that this would be
 206really easy.