+5. current_tf
+===========================
+current_tf is a per-core macro that returns a struct trapframe * that points
+back on the kernel stack to the user context that was running on the given core
+when an interrupt or trap happened. Saving the reference to the TF helps
+simplify code that needs to do something with the TF (like save it and pop
+another TF). This way, we don't need to pass the context all over the place,
+especially through code that might not care.
+
+current_tf should go along with current. It's the current_tf of the current
+process. Withouth 'current', it has no meaning.
+
+It does not point to kernel trapframes, which is important when we receive an
+interrupt in the kernel. At one point, we were (hypothetically) clobbering the
+reference to the user trapframe, and were unable to recover. We can get away
+with this because the kernel always returns to its previous context from a
+nested handler (via iret on x86).
+
+In the future, we may need to save kernel contexts and may not always return via
+iret. At which point, if the code path is deep enough that we don't want to
+carry the TF pointer, we may revisit this. Until then, current_tf is just for
+userspace contexts, and is simply stored in per_cpu_info.
+
+6. Locking!
+===========================
+6.1: proc_lock
+---------------------------
+Currently, all locking is done on the proc_lock. It's main goal is to protect
+the vcore mapping (vcore->pcore and vice versa). As of Apr 2010, it's also used
+to protect changes to the address space and the refcnt. Eventually the refcnt
+will be handled with atomics, and the address space will have it's own MM lock.
+
+We grab the proc_lock all over the place, but we try to avoid it whereever
+possible - especially in kernel messages or other places that will be executed
+in parallel. One place we do grab it but would like to not is in proc_yield().
+We don't always need to grab the proc lock. Here are some examples:
+
+6.1.1: Lockless Notifications:
+-------------
+We don't lock when sending a notification. We want the proc_lock to not be an
+irqsave lock (discussed below). Since we might want to send a notification from
+interrupt context, we can't grab the proc_lock if it's a regular lock.
+
+This is okay, since the proc_lock is only protecting the vcoremapping. We could
+accidentally send the notification to the wrong pcore. The __notif handler
+checks to make sure it is the right process, and all _M processes should be able
+to handle spurious notifications. This assumes they are still _M.
+
+If we send it to the wrong pcore, there is a danger of losing the notif, since
+it didn't go to the correct vcore. That would happen anyway, (the vcore is
+unmapped, or in the process of mapping). The notif_pending flag will be caught
+when the vcore is started up next time (and that flag was set before reading the
+vcoremap).
+
+6.1.2: Local get_vcoreid():
+-------------
+It's not necessary to lock while checking the vcoremap if you are checking for
+the core you are running on (e.g. pcoreid == core_id()). This is because all
+unmappings of a vcore are done on the receive side of a routine kmsg, and that
+code cannot run concurrently with the code you are running.
+
+6.2: irqsave
+---------------------------
+The proc_lock used to be an irqsave lock (meaning it disables interrupts and can
+be grabbed from interrupt context). We made it a regular lock for a couple
+reasons. The immediate one was it was causing deadlocks due to some other
+ghetto things (blocking on the frontend server, for instance). More generally,
+we don't want to disable interrupts for long periods of time, so it was
+something worth doing anyway.
+
+This means that we cannot grab the proc_lock from interrupt context. This
+includes having schedule called from an interrupt handler (like the
+timer_interrupt() handler), since it will call proc_run. Right now, we actually
+do this, which we shouldn't, and that will eventually get fixed. The right
+answer is that the actual work of running the scheduler should be a routine
+kmsg, similar to how Linux sets a bit in the kernel that it checks on the way
+out to see if it should run the scheduler or not.
+
+7. TLB Coherency
+===========================
+When changing or removing memory mappings, we need to do some form of a TLB
+shootdown. Normally, this will require sending an IPI (immediate kmsg) to
+every vcore of a process to unmap the affected page. Before allocating that
+page back out, we need to make sure that every TLB has been flushed.
+
+One reason to use a kmsg over a simple handler is that we often want to pass a
+virtual address to flush for those architectures (like x86) that can
+invalidate a specific page. Ideally, we'd use a broadcast kmsg (doesn't exist
+yet), though we already have simple broadcast IPIs.
+
+7.1 Initial Stuff
+---------------------------
+One big issue is whether or not to wait for a response from the other vcores
+that they have unmapped. There are two concerns: 1) Page reuse and 2) User
+semantics. We cannot give out the physical page while it may still be in a
+TLB (even to the same process. Ask us about the pthread_test bug).
+
+The second case is a little more detailed. The application may not like it if
+it thinks a page is unmapped or protected, and it does not generate a fault.
+I am less concerned about this, especially since we know that even if we don't
+wait to hear from every vcore, we know that the message was delivered and the
+IPI sent. Any cores that are in userspace will have trapped and eventually
+handle the shootdown before having a chance to execute other user code. The
+delays in the shootdown response are due to being in the kernel with
+interrupts disabled (it was an IMMEDIATE kmsg).
+
+7.2 RCU
+---------------------------
+One approach is similar to RCU. Unmap the page, but don't put it on the free
+list. Instead, don't reallocate it until we are sure every core (possibly
+just affected cores) had a chance to run its kmsg handlers. This time is
+similar to the RCU grace periods. Once the period is over, we can then truly
+free the page.
+
+This would require some sort of RCU-like mechanism and probably a per-core
+variable that has the timestamp of the last quiescent period. Code caring
+about when this page (or pages) can be freed would have to check on all of the
+cores (probably in a bitmask for what needs to be freed). It would make sense
+to amortize this over several RCU-like operations.
+
+7.3 Checklist
+---------------------------
+It might not suck that much to wait for a response if you already sent an IPI,
+though it incurs some more cache misses. If you wanted to ensure all vcores
+ran the shootdown handler, you'd have them all toggle their bit in a checklist
+(unused for a while, check smp.c). The only one who waits would be the
+caller, but there still are a bunch of cache misses in the handlers. Maybe
+this isn't that big of a deal, and the RCU thing is an unnecessary
+optimization.
+
+7.4 Just Wait til a Context Switch
+---------------------------
+Another option is to not bother freeing the page until the entire process is
+descheduled. This could be a very long time, and also will mess with
+userspace's semantics. They would be running user code that could still
+access the old page, so in essence this is a lazy munmap/mprotect. The
+process basically has the page in pergatory: it can't be reallocated, and it
+might be accessible, but can't be guaranteed to work.
+
+The main benefit of this is that you don't need to send the TLB shootdown IPI
+at all - so you don't interfere with the app. Though in return, they have
+possibly weird semantics. One aspect of these weird semantics is that the
+same virtual address could map to two different pages - that seems like a
+disaster waiting to happen. We could also block that range of the virtual
+address space from being reallocated, but that gets even more tricky.
+
+One issue with just waiting and RCU is memory pressure. If we actually need
+the page, we will need to enforce an unmapping, which sucks a little.
+
+7.5 Bulk vs Single
+---------------------------
+If there are a lot of pages being shot down, it'd be best to amortize the cost
+of the kernel messages, as well as the invlpg calls (single page shootdowns).
+One option would be for the kmsg to take a range, and not just a single
+address. This would help with bulk munmap/mprotects. Based on the number of
+these, perhaps a raw tlbflush (the entire TLB) would be worth while, instead
+of n single shots. Odds are, that number is arch and possibly workload
+specific.
+
+For now, the plan will be to send a range and have them individually shot
+down.
+
+7.6 Don't do it
+---------------------------
+Either way, munmap/mprotect sucks in an MCP. I recommend not doing it, and
+doing the appropriate mmap/munmap/mprotects in _S mode. Unfortunately, even
+our crap pthread library munmaps on demand as threads are created and
+destroyed. The vcore code probably does in the bowels of glibc's TLS code
+too, though at least that isn't on every user context switch.
+
+7.7 Local memory
+---------------------------
+Private local memory would help with this too. If each vcore has its own
+range, we won't need to send TLB shootdowns for those areas, and we won't have
+to worry about weird application semantics. The downside is we would need to
+do these mmaps in certain ranges in advance, and might not easily be able to
+do them remotely. More on this when we actually design and build it.
+
+7.8 Future Hardware Support
+---------------------------
+It would be cool and interesting if we had the ability to remotely shootdown
+TLBs. For instance, all cores with cr3 == X, shootdown range Y..Z. It's
+basically what we'll do with the kernel message and the vcoremap, but with
+magic hardware.
+
+7.9 Current Status
+---------------------------
+For now, we just send a kernel message to all vcores to do a full TLB flush,
+and not to worry about checklists, waiting, or anything. This is due to being
+short on time and not wanting to sort out the issue with ranges. The way
+it'll get changed to is to send the kmsg with the range to the appropriate
+cores, and then maybe put the page on the end of the freelist (instead of the
+head). More to come.
+
+8. TBD