akaros/Documentation/fd_taps.txt
<<
>>
Prefs
   1FD Taps
   2===========================
   32015-07-27 Barret Rhoden (brho)
   4
   5Contents
   6---------------------------
   7What are FD Taps?
   8Where are the FD Taps?
   9
  10
  11What are FD Taps?
  12---------------------------
  13
  14Where are the FD Taps?
  15---------------------------
  16### Basics ###
  17In Linux, the epoll blob is attached to the File (I think, this is the struct
  18eventpoll).  Linux can get from a sock -> socket -> file -> eventpoll.  From the
  19lower levels of the networking stack, you can get all the way to the accounting
  20info for epoll.
  21
  22In Akaros, and in Plan 9, the analogous object to the file is the chan.
  23However, in the networking stack, the conversation (like a struct sock) does not
  24keep a pointer to it's chan.  Further, there is not a 1:1 correspondence between
  25convs and chans: there could be several chans using the same conv, similar to
  26using several OS files for the same underlying disk file (inode).  Although that
  27might be a bad idea for network connections, it'd be nice to not have FD Taps
  28assume anything about the underlying device.  So for Akaros, we want to have the
  29tap somewhere within the device.  For #I, that probably means hanging off the
  30conversation.  For #M (devmnt), it would be some other struct, where the tap is
  31translated into a 9p message.
  32
  33Another aspect of this issue is that these are "FD" taps, not "file/chan" taps.
  34If you read through the Q&A for epoll's man page, there are a bunch of weird
  35conditions that result from having the tap on the file.  This is due to having
  36multiple FDs point to the same file.
  37
  38The approach I took in Akaros was to have the tap in both the FD and within the
  39device (the conversation).  If we're declaring interest in an FD, the FD is a
  40reasonable place to track that interest.  We also need to track the tap within
  41the device, as mentioned above.  Now we need to sort out the registration of
  42taps and avoid any concurrency issues.
  43
  44### Code Issues ###
  45We need to worry about a few things.  Overall, we want to register a tap on an
  46FD (struct file_desc), and that registration needs to go through the device.
  47Perhaps the device doesn't support taps, or it doesn't support the event filters
  48we requested.  So we need to handle registration failure.  We also need to
  49handle concurrent deregistrations, re-registrations, opens, and and closes.
  50
  51A basic approach would be to lock the FD table, make sure there's only one tap,
  52register the new one with the device, insert into the table, and unlock.  The
  53lock protects adding the tap (can only have one, racing on the FD's tap
  54pointer), concurrent tap removals, enforces the FD points to a file, and
  55protects against FD closes.
  56
  57But the problem is the FD table lock is a spinlock, and we don't want it to be
  58more than that.  Device registration could be a blocking call.  So we need to
  59come up with something else.  Part of the problem involves syncing with two
  60places: the FD and the conv.
  61
  62At this point I thought about putting the tap in the device, and not the FD at
  63all.  Deregistration becomes tricky.  We want to destroy the tap when the FD
  64closes, or at least turn it off.  Say we do something like "after closing,
  65deregister the tap".  We could come up with enough info to the device to make it
  66work - we'd probably want to pass in the FD (integer), proc*, and probably the
  67chan.  However, once we closed, the FD is now free, and we could have something
  68like:
  69        Trying to close:        User opens and taps a conv:
  70        close(5) (FD 5 was 1/data with a tap)
  71                                open(/net/tcp/1/data) (get 5 back)
  72                                register_fd_tap(5) (two taps on 5, might fail!)
  73        deregister_fd_tap(5)
  74        cclose (needed to keep the chan alive)
  75At the end, we might have no taps on 5.  Or if we opened 2/data instead of
  761/data, the deregister_fd_tap call will accidentally deregister from the new FD
  775 instead of the old one, and the old one will still be active!
  78
  79Maybe we deregister first, then close, to avoid FD reuse problems.  Remember
  80that the only locking goes on in close.  Now consider:
  81        Trying to close:        User tries to add (another) tap:
  82        deregister_fd_tap(5)
  83                                register_fd_tap(5)
  84        close(5) (was 1/data with a tap)
  85Now we just closed with a tap still registered.  Eventually, that FD tap might
  86fire.  Spurious events are okay, but we could run into issues.  Say the evq in
  87the original tap is no longer valid.  It was buggy for the user to perform this
  88operation, but there are probably other issues.  And we didn't even get in to
  89how registration works (register before putting it in the FD table?  After?
  90What about concurrent ops?)
  91
  92We could flag the FD as 'untappable'.  But it seems that we're going to need to
  93sync with the FD table regardless of where the tap exists.  We might as well go
  94back to the original plan of having the tap hang off the FD in some manner.  It
  95makes the most sense, aesthetically, since the FD tap is an attribute of the FD.
  96
  97One trick that would help with FD reuse is to have the device op for
  98register/deregister take the fd_tap pointer.  Not only can we squeeze more info
  99in the tap without mucking with the function signature, but the main benefit is
 100that so long as the FD tap is allocated, it is unique.  FD = 5 can be reused.
 101FD_tap = 0xffff800012345678 is unique.
 102
 103However, simply adding the tap pointer to register() isn't enough.  Say we did
 104the basic "lock the FD table, (basic checks), attach the pointer, unlock, call
 105device register, then free it if register fails", and a dereg locks the table,
 106yanks it out, then call device dereg, then frees.  We still have some issues:
 107
 108- What if a deregister occurs while we are still trying to register and failed?
 109  Who actually frees the FD tap?  We can't completely free it while the other op
 110  is in progress.  That sounds like a job for a kref on the FD tap.
 111
 112- What if we added the tap, then go to register, then it fails, then we have a
 113  concurrent close try to deregister it.  Now we have concurrent deregisters.
 114  We can deal with this by having the device op accept spurious deregisters, but
 115  that's ugly (and unnecessary, see below).
 116
 117- What if a legit deregister occurs while we are registering and eventually will
 118  succeed?  Say:
 119                                                sys_register_fd_tap(0xf00)
 120                                                adds to fdset, unlocks
 121        close(5)
 122        yanks 0xf00 from the fd
 123        deregister tap 0xf00 (fails, spurious)
 124                                                register tap(chan, 0xf00)
 125        free 0xf00?
 126The deregister fails, since it was never there (remember we said it could have
 127spurious deregister calls).  Then register happens.  But the FD is closed!  And
 128then who is freeing the tap?  Hopefully we don't free it while the device still
 129has a pointer...
 130
 131The issue here is the assumption that the tap would have been registered.  Since
 132we unlock the FD table, we can violate those assumptions.  We want to guarantee
 133the order of register/deregister operations, such that register happens before
 134deregister.
 135
 136It turns out that the kref can do this too!  The trick is to use the release
 137operation to do the deregistration.  That ensures that so long as a reference is
 138held, we won't call deregister *and* that deregister will happen exactly once.
 139close() simply becomes "lock the FDT, remove the tap, unlock, decref": extremely
 140simple.  Note that decref could trigger the release method which could then
 141sleep (since it calls into a device), so we decref outside the lock.  register()
 142ups the refcnt by two, one for itself to keep the tap alive (and preventing a
 143concurrent dereg) and one for the pointer in the FD table.
 144
 145Note that as soon as we unlock, our tap could be decref'd and a completely new
 146tap could be added and registered for that FD.  That means the following can
 147happen:
 148        lock FDT
 149        add tap 0xf00 to FD 5
 150        unlock FDT
 151                                                lock FDT
 152                                                remove tap from FD 5
 153                                                unlock FDT
 154                                                decref 0xf00
 155                                                (new syscall)
 156                                                lock FDT
 157                                                add tap 0xbar to FD 5
 158                                                unlock FDT
 159                                                register tap 0xbar for FD 5
 160        register tap 0xf00 for FD 5
 161        decref and trigger a deregister of f00
 162
 163In this case the device could see two separate taps (0xf00 and 0xbar) for the
 164same FD (5).  It just so happens that one of them will deregister soon.  It is
 165also possible for an event to fire between the left column's register and
 166decref, at which point two events would be created (possibly with the same evq
 167and event id).
 168
 169The final case to consider is when registration fails.  To keep things simple
 170for the device, we can make sure that we only deregister a tap if our register
 171succeeded.  To do this nicely with krefs, we can simply change the release
 172method, based on whether or not registration succeeds.
 173