akaros/Documentation/vfs.txt
<<
>>
Prefs
   1Unorganized Notes on the Virtual File System
   2-------------------------------------
   3Our VFS is built similarly to Linux's, mostly because I'd like to have somewhat
   4easy integration with ext2 (or at the very least have our own FS that can
   5integrate into Linux easily).
   6
   7There are four main objects in a filesystem: superblock, inode, dentry, and a
   8file.  
   9        - Superblock: Specific instance of a mounted filesystem.  All
  10          synchronization is done with the one spinlock.
  11        - Inode: represents a specific file
  12        - Dentry: in memory object, corresponding to an element of a path.
  13          E.g. /, usr, bin, and vim are all dentries.  All have inodes.  Vim
  14          happens to be a
  15          file instead of a directory.
  16        - File: represents a file opened by a process.
  17
  18So far, dentries are the most complicated, so here are some notes about them.
  19These are probably self-consistent, but may be a bit old (or just wrong) because
  20I wrote them as I put the VFS together:
  21
  22A dentry is just a connection between a part of a path and an inode.  We just
  23want to walk from dentry to dentry, asking each to find the next part in the
  24path.  When you hit a dentry that's not the end of the desired path, you ask its
  25inode for the next dentry by using it's operation (lookup).
  26
  27lookup() takes the inode of the directory (or 0 for the root) and a dentry
  28(already allocated, with the d_name and i_name filled in), and it will find the
  29correct inode for the given dentry, as well as filling out any specific FS
  30things in the dentry (like d_ops).  It will return 0 on failure, or the dentry
  31pointer passed in on success.  Somehow the nameidata bit will be used too.  This
  32will probably change a bit...  Note that lookup() needs to read the actual
  33directory file and also lookup the inode off the disc, which means it will
  34block.
  35
  36When the next dentry is a mountpoint, (e.g. /mnt/cdrom), when you ask mnt for
  37cdrom, lookup() will parse it's file (a directory) and see 'cdrom' there as a child
  38entry.  Then it will return cdrom's dentry, like always.
  39
  40But since cdrom was a mountpoint, (which you can see from the d_mount_point),
  41you need to walk through the structures a bit to get the dentry of the FS rooted
  42at that mountpoint to keep walking.  The VFS can handle that, so lookup()
  43doesn't need to worry about it.
  44
  45Why are there two dentries associated with a vfsmount?  Normally, a dentry for a
  46directory points to an inode that contains its members.  When a FS is mounted
  47over it, we don't want to destroy that.  Instead, we mark it as a mountpoint,
  48and when following, we walk into the next FS.  The mountpoint is the dentry in
  49the parent mount (using the parent mount's FS superblock), and the root is the
  50dentry of the mounted FS, using the FS superblock of the mounted FS.
  51
  52Sometimes lookup() will not be called at all.  We want to look for present
  53dentries first (ones that exist in memory already, including all of those with a
  54refcnt).  So we can look in the cache (currently just an SLIST, but will
  55eventually be a hash table).  When hitting in the cache (hashed by dentry name,
  56and probably the parent dir), we need to verify that the dentry's parent is our
  57own (handled by the VFS code).  All vfsmount dentries will be in the cache,
  58since they all have a refcnt (the vfsmount struct points to them).
  59
  60Dentries for / and pwd and whatnot have a refcnt (pointed to by fs_struct,
  61vfsmounts, etc).  Anything with a pointer to it has a refcnt, so it never goes
  62away.  So if we have an inode in memory, it's entire dentry path is cached,
  63(just follow the parent pointers back).  Note that when a dentry's refcnt hits
  640, we do *not* deallocate (like we do with many other structures).  It will just
  65be on the LRU list in the dcache system.  Likewise, every dentry points to its
  66inode, which pins that inode in memory.
  67
  68Other refcnts: just about everything has a refcnt, and we need to be careful
  69about when we want to use them and dealloc.  Some things would be a pain in the
  70ass, like with the super_block refcnt.  Every dentry has a ref to the SB, but
  71changing that ref every time we add or remove a dentry will probably be an
  72unnecessary penalty if we can ensure all dentries of that FS are gone before
  73removing the superblock through another mechanism.  We'll see.  Mostly, this
  74just means we need to think about what we really want from a refcnt, and whether
  75or not we want the kref / process style refcnting.
  76
  77Mounting:
  78-------------------------------------------
  79When you mount, you need to read in the super block and connect the relevant
  80data structures together.  The SB is connected to the vfsmount, which is
  81connected to the dentry of the mount point and the dentry of the root of the FS.
  82This means when mounting the FS, we need to create the dentry for "/", which
  83means we also need the inode, which needs to be read_inode()'d in.  Actually, we
  84might not need to read the inode in right away - we might be able to get away
  85with reading them in on-demand.
  86
  87All of this means that for every mount point, the SB, vfsmount, dentry, and
  88inode are in memory.  Due to the way dentries link, every dentry and inode back
  89up to the real root are all in memory too.  Having a mount point is like having
  90a process working in that directory - the chain back up is pinned.
  91
  92d_subdirs:
  93-------------------------------------------
  94Tracking the links between objects can be tricky.  One pain is d_subdirs. Linux
  95only tracks subdirectories.  We also do this.  I think the reason they do it is
  96since you should be using the dcache to cache lookups, and not scan the linked
  97list of children of a dentry for a specific file.  Though it is handy to know
  98all of your *directory* children.  In KFS, we also track all children in a list.
  99This is to make our lookups work - instead of having an actual directory file
 100with name->ino mappings.
 101
 102KFS and metadata pinning:
 103-------------------------------------------
 104KFS pins all metadata ("all paths pinned").  This means that from the root mnt
 105down to the lowest inode, all dentries and corresponding inodes are pinned in
 106memory from creation time.   Yeah, this means we have chunks of metadata for
 107files we aren't using sitting around in RAM.  We also have the *files* sitting
 108around in RAM too.  Not that concerned, for now.  Plus, I don't want to reparse
 109the CPIO backing store to figure out inode fields, directory names, etc.
 110
 111Page Cache:
 112-------------------------------------------
 113Every object that has pages, like an inode or the swap (or even direct block
 114devices) has a page_map tracking which of its pages are currently in memory.
 115This is called a struct address_space in linux, which is a confusing name.  We
 116don't have all of the same fields yet, and may be doing things slightly
 117differently, but the basics are the same.  Together, the page_maps and the
 118functions to manipulate them make up the Page Cache.  Every page frame that is
 119in a page mapping can be traced back to its page_map, via pointers in the struct
 120page.  Note the page_mapping is tracked twice for a file, the f_mapping and the
 121i_mapping.  We really only need the i_mapping, but this saves a couple cache
 122misses.  Might go away later.
 123
 124As a side note, Linux's "address_space" has a great example of the power of
 125their linked lists.  Their struct has a private_list head.  Due to their list
 126implementation, they are able to have a generic list head for a list of any
 127type (it's a struct list_head), and don't need to declare in a generic object
 128(like the page_map) a specific list type (that would get used by specific
 129FS's).
 130
 131Just because a page is in a page_map, it doesn't mean it actually has the
 132data from the disc in it.  It just means that there is a physical frame
 133dedicated/mapped to be a page_map's holder for a specific page of an object
 134(usually a file on disc).  readpage() is called to fill that page in with what
 135it is supposed to have from its backing store.
 136
 137This interpretation makes the meaning of "address space" make more sense.  It's
 138the "address space" of the device, mapping from (file,index) -> page_frame.
 139Still, calling it an address space just confuses things with the virtual memory
 140address space.
 141
 142One of the reasons pages can be in the map without up-to-date data is due to
 143concurrency issues and outstanding I/O requests.  When the page is first being
 144filled up, the mapping exists but the data hasn't been brought in yet.  Other
 145processes can be trying to access the same block/page/byte, and they need to
 146block but to not try and schedule the operation.  
 147
 148So here's how a typical page fault (__handle_page_fault(), either on demand or
 149populated) works on an mmap'd file at a high level.
 1501. PF is on a virt address -> translated by the vm_region to a file/offset (page).
 1511b. Weird permission issues?  See below!
 1522. Look it up in the page_map.
 1533. If the page is already there, and up-to-date, then great.  Skip to 6.  If
 154there is one, but it's not up to date, skip to 5.
 1554. If there is no page, get a free page, tie it (bidirectionally) to the inode
 156in the page_map.
 1575. Now there is a page, but it is not up to date, so call readpage().  This will
 158usually block.
 1596. Map that page (which has the current contents) into the address space of the
 160calling process (with the appropriate permissions, RO (MAP_PRIVATE, CoW), or
 161RW (MAP_SHARED).
 162Below: Now if in step 1 you had a permissions issue, such as a RW fault on a CoW
 163MAP_PRIVATE, you'll have to notice the type of error and the type of memory
 164region, then go through a separate path: get a new page, copy the contents, and
 165change the mapping.  Note, the page backing that mapping is not backed by the
 166file - it's just sitting around in the virtual memory of the process.
 167
 168Also, if we want to use the PG_DIRTY flag, we'll need mark the regions as RO
 169until we write fault, at which point we dirty the page and change it to RW.
 170
 171We could have multiple threads trying to fill a page in the page cache at once.
 172This is handled in file_load_page().  All threads check the page cache.  If two
 173threads try to add it to the page cache, only one will succeed, and the page
 174will be locked (PG_LOCKED).  The one who succeeds will readpage().  The one that
 175didn't will be like any other thread that is checking the page cache - it will
 176see a page is there, and will check it the page is up to date.  If it isn't, it
 177will try to lock the page so it can do the IO, with a few extra checks in case
 178the page had been removed or was filled in while it slept.
 179
 180A big aspect of this is the use of lock_page() to block.  If the page is locked,
 181you block until it is unlocked.  (implementation and other issues still
 182pending).  Even the caller of readpage will lock after submitting the IO
 183request.  This will cause the caller to sleep until the IO is done.  When the IO
 184is done, that interrupt handler/thread will mark the page as up-to-date, and
 185unlock the page, which will wake up any of the waiters.  The caller of
 186readpage() may not be the first to wake up either.  This all means the IO system
 187needs to know about marking pages as up-to-date and unlocking them.  This
 188currently (Jul10) is just sitting in KFS, but can be done later either directly
 189or with a callback made by
 190whoever starts the IO.
 191
 192A note on refcnting.  When a page is added to the page cache, that's a stored
 193reference.  When you lookup a page in the page cache, you get a refcnt'd
 194reference back.  When you pull a page from the page cache, you also get a
 195refcnt'd reference back - specifically it is the ref that was in the page map.
 196
 197Files with Holes
 198--------------------------
 199If a file has a hole, we'll still try to look up the page in the page cache.
 200When that doesn't happen, we'll create and add a page, then call readpage().
 201Readpage will realize there is no page on disk/backing store for that part of
 202the file (since it was a hole) and just memset the page to 0.  In the future, we
 203can consider getting a CoW 0-page, but that's a bit premature and a bookkeeping
 204pain.
 205
 206This also applies to trying to write to a block beyond the EOF.  If the request
 207hits the page cache and readpage(), it's because it was already checked and
 208cleared in another part of the VFS, such as in generic_file_write().
 209
 210Files That End Before a Page Boundary
 211--------------------------
 212So what if we have a file that is only 1024 bytes.  When we read it in, we'll
 213have a whole page added to the page cache, with the extra 3K 0'd.  When we go to
 214write it back, it will write back the 1K, and ignore the rest.  But what if we
 215extend the file later?  That page is already in the page cache, but the buffer
 216heads aren't tracking the new 3K.  When that page gets evicted, only the 1K will
 217write back.
 218
 219There are two ways to deal with this: 1) When we extend the file, check and see
 220if it is in the middle of a page, and if so, alloc the blocks (all FS
 221specific!), and adjust the BHs to map the new data.  2) Preallocate the
 222remaining blocks and stitch up the BH mapping at readpage (which is in the
 223specific FS).  This way, we reserve the blocks (but don't have to actually read
 224them in - we just mark the buffer as dirty).  When we grow a file, we don't need
 225to worry about any of these details.  We just make sure the page map has the
 226page, and not whether or not it's a half-page that needs a FS-specific solution.
 227I'm going with #2 for now.  Note that while the blocks are allocated, the file's
 228size is still 1024B.
 229
 230Kref, Dentries, Inodes, and Files (or "I don't see why it's like X, but..."
 231--------------------------
 232There are multiple dentries pointing to an inode.  The dentries are (or will be)
 233cached too, but that is irrelevant.  The dentries pin the inodes in memory.
 234However, files pin inodes in memory (or they did) briefly.  After running around
 235in circles a bit, I asked, why doesn't the file pin the dentry instead of the
 236inode?  The answer: it is supposed to.  Linux does that, and I didn't because
 237pinning the inode made more sense at the time.
 238
 239The heart of the issue is about understanding what files are and how they
 240relate to the rest of the VFS.  A 'file' in the OS is a structure to track an
 241open FS-disk-file, which is managed by the inode.  Given this, it makes sense
 242that a dentry (which is a name on a path) would be pinned by the corresponding
 243inode, and the file would pin the inode.  It doesn't, but it is believable.  In
 244reality, there are many names (dentries) for a given disk file, and the OS file
 245that you open corresponds to *one* of those names, and thus a dentry, and not to
 246the inode/specific file.  You need to go through the dentry to pin the inode.
 247
 248In short, it's not: file -> inode -> dentry -> parent_dentry -> ...
 249It's file -> dentry -> parent_dentry ->
 250             |-> inode      |-> parent_inode
 251Another dentry and file (both OS and disk) can point to the same inode.  If you
 252don't do it this way, you can't pin up past the multi-dentry-point in the inode,
 253and your system doesn't really make sense.
 254
 255So here is how it works: files pin dentries.  Dentries can pin other dentries,
 256on up the directory hierarchy.  Independently of the files, dentries pin their
 257inode.  There are many dentries per inode (or can be).  Since each dentry
 258doesn't know if it is the last dentry to decref the inode, we use a kref on
 259i_kref.  The inodes are storing references to the dentries, but they are the
 260kref "internal" / weak references.  Even if we did decref them, we don't trigger
 261anything with it.
 262
 263The moral of the story is that if you don't fully understand something, you are
 264not in as good of a position to recommend changes or criticize as if you did
 265your homework.  Not that you can't, just that you should 'do your homework.'
 266
 267Musings on path_lookup()
 268--------------------------
 269Things can get tricky with path lookup, especially with ., .., and symlinks.
 270When doing a LOOKUP_PARENT on a . or .., we give the parent of whatever the path
 271would have resolved too.  So /dir1/dir2/dir3/.'s parent is dir2.
 272/dir1/dir2/dir3/..'s parent is dir1.  I don't think Linux does this (note the
 273parent lookup is for internal kernel stuff, like when you want to edit
 274metadata).  When you try to make a . or .. file, you should get some sort of
 275error anyways.  We'll see how this works out.
 276
 277Symlinks can be a bit tricky.  We handle ours a bit differently too, especially
 278regarding PARENT lookups.  Ultimately, you can do the same things in ROS that
 279you can do in Linux - if you try to create a file that is a dangling symlink,
 280you'll correctly create the destination file.  We handle this in
 281link_path_walk().  It will return the PARENT of whatever you would resolve to -
 282instead of trying to handle this in do_file_open() (which I think linux does).
 283
 284Also, our handling of symlinks differs a bit from linux.  Eventually, it has
 285become clear we're going to need to manually port ext2, and we do some things
 286differently in our core VFS already.  Might as well do more thing differently -
 287like getting rid of follow_link and put_link from the FS specific sections.  Our
 288FSs just need to know how to return a char* for a symname - and not do any of
 289the actual link following.  Or any of the other stuff they do.  We'll see if
 290that turns out to be an issue or not...
 291
 292Unlinking and other Link Stuff
 293-------------------------
 294Unlinking is just disconnecting a dentry-inode pair from the directory tree, and
 295decreasing the inode's i_nlink.  Nothing else happens yet, since we need to keep
 296the FS-file (controlled by the dentry/inode) so long as any OS-files have it
 297open.  They have it opened via open or mmap - any way that there is a reference
 298to a file, which then pins the dentry and inode.  When the OS-files close,
 299eventually the dentry's refcnt hits 0.  When it does, it normally would be up
 300for caching, but we can check nlinks and just drop it.  When that happens, it
 301releases the inode, which will see its nlinks is 0.  That will trigger the
 302underlying FS to clear out the FS-file.
 303
 304For directories, you can only have one hardlink to a directory - meaning you are
 305only in the directory tree in one place.  However, all of your children can get
 306to you by going '../'.  We'll count these as hardlinks too.  This means that
 307every child increments its parent-dir's nlink.  This is the on-disk links, not
 308to be confused with the dentry->d_parent and kref() business that goes on for
 309the in-memory objects.  A directory cannot be removed if nlinks > 1.  If it is
 3101, then you can rmdir it, which will set its nlinks to 0.  Then its inode's
 311storage space will get freed when it is deleted, like any other inode.  In
 312theory.
 313
 314Inodes: drop? delete? dealloc?
 315--------------------------
 316Inodes exist both in memory and on disk, but in different manners.  When a file
 317(F_WHATEVER, could be DIR) exists in an FS, it'll have an inode on disk.  When
 318it is time to delete that file, we call _delete_inode().  When we want to free
 319the memory associated with an in-memory (VFS) inode, we call _dealloc_inode().
 320
 321What about drop_inode?  For now, we don't use it.  We have inode_release() in
 322the VFS.  If we need an FS specific one (like for ext2), or have FS-specific
 323work that needs to be done in inode_release(), we'll use it later.
 324
 325Either way, inode_release() is called when we no longer use the in-memory inode.
 326If there are no hard links, it will delete the inode.  Either way, it will just
 327free the in-memory inode (after deleting the disc version).
 328
 329Example: I want to unlink (rm) a file.  There are two cases: the file is already
 330open (with a dentry and the inode in memory) or the file is not.  Both cases are
 331handled the same way!  In either case, we eventually call do_lookup on the item
 332in question, getting both a dentry and its inode read in.  (We read it in for a
 333couple reasons: convenient to check the type, and we need to manipulate the
 334nlink).  If a process has the file open, or even if it is sitting in the cache,
 335we will get the same inode (from the inode cache, might not be implemented yet).
 336When we decref the dentry and it is done, it will decref the inode.  This
 337dentry's final decref will be deferred until any open files are closed.  Note,
 338this requires a working dentry/inode-cache - otherwise we'll have multiple
 339copies of the same FS/disk-inode (and possibly dentry).  Anyway, when this is
 340done, the release function will delete the inode, then dealloc it.
 341
 342Another example:  We simply close a file.  When that happens, we decref the
 343dentry, which decrefs the inode.  It may remain cached for a bit - not a big
 344deal.  When it is finally removed, nlinks is positive, so the inode's in memory
 345copy is written back (if it was dirty) and the structure is deallocated.
 346
 347Side notes: dentry cached inodes should be removed after their lookup in unlink.
 348Also, since multiple dentries point to the same inode, it's not enough to just
 349cache dentries - we need to be able to find inodes too so that we get the one
 350inode regardless of which dentry we use (which may be uncached).
 351
 352Dentry and Inode Caches
 353--------------------------
 354The dentry caches dentry lookups - we need the parent and a hash of the name to
 355do a lookup.  The dcache consists of a hash table for the lookups, as well as an
 356extra list of entries that are unused (their kref is 0).  The dcache also caches
 357negative entries - entries that were wrong.  This still speeds up future
 358requests.  Most uses of the system just need to use dcache_get and dcache put.
 359Not all of this is implemented yet.
 360
 361The inode cache is similar, though we can't just have the inodes hang off the
 362dentry cache (more than one dentry points to the same inode).  We don't need to
 363worry about unused lists or anything like that - once the kref hits 0, we're
 364done and we can rip it out of the hash.
 365
 366Both hashes hang off the superblock, with concurrent access protected by locks
 367in the SB.
 368
 369The dentry cache is the weirdest of them all - for normal entries, its key and
 370value are the same thing.  The actual hashing of a dentry is done by the qstr
 371value, and to determine equality, we need to compare parents (compared to the
 372inode cache, where the only thing that matters is the i_ino).  Put another way,
 373we need elements of the whole dentry to get a unique key (d_parent and d_name).
 374
 375As stated above, the dcache also caches negative entries.  This is to prevent a
 376lookup on disk.  These negative entries are in the dcache and on the LRU list
 377(their refcnt is 0, the are not USED).  When we dcache_get, we don't bother with
 378returning the actual dentry (after increffing) and then decref it again.
 379Instead, we just return the negative result (via the query dentry,
 380incidentally).
 381
 382Freeing of dentries happens in one of two ways: call __dentry_free() directly,
 383which is appropriate when you have the only copy (like in do_lookup()), or it
 384will get freed when the dcache gets reaped (the LRU entries are freed).  When it
 385is decref'd, it simply goes into a state where it is ready to be reaped, but
 386kept around for future lookups - most usages throughout the vfs can just decref
 387when they are done using it.
 388
 389One complication is cached negative dentries.  These are only referenced once
 390(in the dcache), so they can get __dentry_free()d directly.  This gets tricky
 391with rmdir and unlink.  Initially, those functions marked the dentry as negative
 392and unused, and would let them stay in the dcache (returning negative results on
 393future lookups).  The problem with this is that now the dcache could have a
 394negative dentry that was a real, formerly used dentry - one with a refcnt that
 395needs to be decref'd and released.  
 396
 397There are two solutions: one is to change the dcache to not assume that negative
 398entries are unreferenced (which also means on the LRU).  The other is to just
 399remove the dentry from the dcache on rmdir/unlink.  It won't be negative - and
 400that won't matter, since it is un-lookup-able.  And it will die nicely when it
 401gets decref'd.  All we'll do is add a DENTRY_DYING flag, and dentry_release()
 402will avoid LRU and unusing it.  The dcache can continue to assume that negative
 403entries are unused/LRU/dentry_freeable/ref==0, and not worry about calling
 404kref_put().
 405
 406X: Buffer Cache, Page Cache, Buffer Heads, WTH?
 407-------------------------------
 408X.1: Page vs Buffer Cache
 409--------------------
 410So we discussed the page cache, but as described, it does not satisfy all of
 411our disk caching needs.  Traditionally, there would also be a 'buffer cache.'
 412Buffers usually refer to memory used to hold data from the disk (or network).
 413I can think of a couple different ways someone could think of the buffer
 414cache, and to understand them, we first need to understand what we need to be
 415caching.
 416
 417There are two main types of FS data: file data and metadata.  This metadata
 418is FS-specific data accessed by the VFS to get to a file's contents.  For
 419example, the superblock, inodes, inode indirect blocks, and to a certain
 420extent the directory's contents are all metadata.  There isn't an FS file (not
 421to be confused with an OS file) that corresponds to this data, but it needs to
 422be read in and cached.  Furthermore, this cache needs to be managed and
 423written back when dirty.  Note that file data can be broken up into two
 424different types: read()/write() data and mmap'd data.  When people talk about
 425buffer caches versus page caches, they are talking about these two different
 426types of file data (at least NetBSD did
 427http://www.usenix.org/event/usenix2000/freenix/full_papers/silvers/silvers_html/).
 428
 429A very simple buffer cache would include anything *ever* read from disk.  It
 430would then get copied into the page cache in PGSIZE chunks for the page cache.
 431This would suck, since we now have two or three copies.  We obviously want to
 432use the page cache for both mmap() and read/write().  It's not clear about the
 433metadata.
 434
 435Another usage of a buffer cache would be to cache only the disk blocks needed
 436for metadata.  I think this is what Linux did before it unified its buffer and
 437page caches (implied in UTLK).  The main issue with this is that you have two
 438different systems for managing essentially similar data - we only want to deal
 439with flushing, syncing, and writebacks of one subsystem, not in multiple
 440different subsystems.
 441
 442The solution is to use the page cache to cache the metadata blocks' buffers.
 443We do this by having the block device be 'mapped' (but not read) in PGSIZE
 444chunks through its own struct page_mapping (a.k.a. struct address_space in
 445Linux).  This way, both files and the block device are mapped in PGSIZE chunks
 446via the same page_mapping structure, and will be managed by the same code.
 447
 448Sort of.  We don't actually read in PGSIZE chunks for the block buffer cache.
 449If blocks will be in the bdev's cache, then they will be adjacent and on the
 450same page.  It is possible some adjacent blocks (which would be on the same
 451page) are not both metadata.
 452
 453A more accurate way to describe what we do is that metadata blocks are copied
 454into a 'buffer cache' that is mapped and managed similarly to the page cache.
 455Pages are made of buffer heads, which hold data, and the reclaiming of pages of
 456memory from either the page cache or the buffer cache will use the same code -
 457since they are both just made of buffer pages.
 458
 459X.2: Mapping Blockdev Data vs File Data
 460--------------------
 461An important distinction between file data (as managed by an inode) and the
 462bdev is that PGSIZE chunks of data for the bdev *must* be made of contiguous
 463disk blocks.  Ideally, file data is also contiguous/sequential, but that is
 464not always the case - hence the need for the inode's block pointers.  This
 465means that the chunk at the bottom of the page_mapping radix tree is a page,
 466and on that page there may be several buffers, holding the data of
 467nonsequential disk blocks - but that not all pages are like this.  The bdev
 468pages are made up of sequential blocks due to the very nature of what we are
 469mapping; there's no inode abstraction in between.
 470
 471There are a couple ways we could handle this.  We adopt the Linux approach of
 472using something called a buffer head (BH), which describes the mapping from
 473in-memory buffer to block device / block number.  These are slab-allocated,
 474and exist for each buffer of a page.  The page itself points to the first of
 475its BHs, all of which exist in a LL.  The maximum number of them is determined
 476by PGSIZE / blocksize.  Whenever there is a page in the page cache (meaning, in
 477a page_mapping), that is up to date, it will have a BH.
 478
 479Another way would be to not have BHs at all, and just figure out (at
 480operation-time) what the n blocks on disk are for any given page, and submit
 481the IO operations for those blocks.  The BHs serve as a cache of that info.
 482They also serve as a location to store data about the mapping, such as whether
 483it is dirty or not, whether the data is up to date or not (has it been read
 484in), etc.  The caching ought to be rather beneficial, since it means we do not
 485need to access the disk-inode and indirect blocks whenever we want to perform
 486an operation (which may be frequent - imagine the periodic writeback of an
 487mmap'd file undergoing writes).  The per-buffer dirty tracking will also help:
 488no need to write unrelated blocks if only one is edited (though this will not
 489help with mmap(), since we don't usually know which part of the page is
 490written).
 491
 492X.4: Do we always need BHs?
 493------------------
 494But what about those pages made up of contiguous blocks?  We don't have a
 495bunch of independent, non-sequential blocks that we need to map.  Do we need a
 496bunch of BHs for that?  Do we need any?  It really seems excessive to break it
 497up into separate buffers for no reason.  At the other extreme, we could get by
 498without having a BH at all, though this gets back to the other issue of
 499caching.  What we do (or will do) is have one BH for the entire page of
 500contiguous blocks.  If the page is a "buffer page," in Linux terms (meaning it
 501has separate buffers), it will have n BHs in a LL.  Either way, we'll always
 502have the mapping handy.  We wouldn't need to re-verify the contiguous nature of
 503the blocks anyways, since the fact that the page was up to date and didn't need
 504a BH would mean it was contiguous.  Further benefits to using the one BH
 505include: 1) we are likely to make BHs be the unit of IO *submission*, and having
 506one handy will simplify that code. 2) some code paths within the VFS may get BHs
 507as a return value, which they can then dirty.  Always having a BH makes this
 508easier (no need to find out if it's a buffer page, then decide how to dirty it).
 509
 510Another compliation with this is certain code will want a block in the middle
 511of a page (very common for metadata).  That code will get the BH for the
 512entire page back.  It will need to determine the starting block and then the
 513offset of the block it wants.  Note, this usage of the BHs is finding the
 514buffer corresponding to a block number.  The BH is a bidirectional mapping
 515between buffers and blockdev:blocknum.  These calculations are a minor
 516complication, but easy enough to do, and will probably be worth it.  The
 517tradeoff is against having multiple BHs, which would mean multiple block IO
 518requests for writing a single page (and writing the whole page may be a common
 519operation).
 520
 521X.3: What about opening a blockdev as a file?
 522--------------------
 523Basically, don't do it for now.  The way we actually do things is a buffer cache
 524page is "up to date", but has no BHs or data until a specific block is
 525requested.  This is because we don't always know if all the blocks mapped by a
 526page are actually metadata blocks.  If they aren't we'll have issues where we
 527read in extra blocks that exist in both a file's page cache and the block
 528device's buffer cache.
 529
 530A safe invariant is that a given block will only ever be in one cache: either a
 531file's page mapping or the buffer cache's page mapping.  When these blocks are
 532freed, we'll need to rip them out of their mapping (and possibly flush them).
 533
 534There is one easy way to avoid this: don't open a bdev file if a file system is
 535mounted on top of it.  If you do, don't be surprised about inconsistencies.
 536Ideally, the FS will never leave the actual disk in an inconsistent state, but
 537the bdev's page cache could read things at different times and get some weird
 538results.  Just don't do this (for now - not like I plan to make this possible
 539any time soon).
 540
 541Could we use the same buffers for both the blockdev-file page mapping and the
 542inode-file page mapping?  No - because the inode-file has the inode
 543indirection in between, which means a PGSIZE chunk of the file might not be
 544contiguous (as mentioned above).
 545
 546We could have tried to avoid this bdev file problem by having the page mapping
 547radix trees point to a set of BHs that describes that page worth of buffers,
 548so that the bdev and an inode pointing to the same data will use the same
 549buffers and BHs.  That won't work, since the files buffers/blocks aren't in
 550the same order as they are on disk within a page.  Instead, we'd need to have
 551the page mapping go down to the FS's blocksize granularity, not to PGSIZE, so
 552that we could have independent leaves point to wherever they want.  This would
 553push the specific block device's blocksize into the VFS (which I don't like).
 554
 555But the blocksize-infecting-the-VFS alone isn't too bad.  The real issue is
 556what is at the end of the page mapping.  Is it a page or a buffer/BH?  We want
 557pages for two related reasons: higher levels of the kernel's IO systems deal
 558with pages:
 5591. mmap().  Good old mmap() requires at least a page of contiguous block
 560buffers, so that the physical page frame can be mapped directly into a
 561process's address space.
 5622. Fast IO.  We want to use page remapping for fast IO.  This means that IO
 563has to be in PGSIZE chunks - not blocksize chunks.
 564
 565x.4: Managing Dirty Buffers
 566--------------------
 567Many (some/all?) of the block functions will return a BH to describe the
 568mapping.  One reason for doing this (mentioned above) is to allow the caller
 569to manage if a buffer is dirty or not.  The tracking of dirty pages is done by
 570the page cache.  The dirtying is done by the caller; it can simply mark the BH
 571dirty and forget about it.  The writeback is handled usually by the page cache
 572or is triggered by an FS event.  Eventually, we'll have a kernel thread that
 573periodically writes dirty buffers back to disk.  Processes can also do this by
 574calling fsync.  FSs themselves will trigger syncs of metadata.  This will come
 575from having dirty SBs and inodes in the VFS.
 576
 577Note, the issue of whether or not we pin metadata blocks, such as the inode's
 578indirect blocks (or other related blocks), in the page cache is independent of
 579all these issues.  If they are not cached / pinned, we would just have to
 580block and reread the inode's structures back into memory and proceed (for
 581instance, we'd do this when identifying blocks for a page mapping on a file
 582read).  The reason we wouldn't want to pin them is to save memory.
 583
 584x.5: Reference Counting BHs and Pages
 585--------------------
 586There are a lot of complications with this, and if there is a good reason,
 587we'll change this later.
 588
 589x.5.1: Basics
 590-----------
 591So we talk about passing around BHs, both when submitting IO ops and when
 592returning from things like get_buffer().  However, currently, we do not kref
 593or reference count BHs in any way.  Instead, we kref pages.  We do this (for
 594now) for a couple reasons:
 5951) Pages are the unit of memory management in the kernel.  Higher levels of
 596the kernel will pin/incref the page (such as in an mmap()).
 5972) BHs hang off of a page, but exist only to be expressive about the
 598management of the buffers on the page.  It's not like how a file hangs off a
 599dentry, where the dentry doesn't know (or care) about the file.
 6003) We already refcount pages.  While we could do the same for the BHs, it is a
 601bit redundant.  Any place that would be kreffing the BH can just kref the
 602whole page.  Code that receives a BH as a return value is actually getting a
 603page under the covers (though should use put_buffer() to drop its reference).
 604
 605x.5.2: How we refcnt pages in a page mapping
 606-----------
 607When a page is in the page cache, we give it one kref.  Whenever a function or
 608subsystem is using one of these pages (IO, using the data, etc), there needs
 609to be a kref.  When it is time to remove the page from the page mapping, the
 610code needs to remove it from the radix tree, then kref_put it.  When the final
 611user is done with it (ideally, there is none, but we need to handle the case)
 612the release function will clean up the page - including freeing its BHs.
 613
 614This does suck in that we could be removing an item from the page cache while
 615it is being used, violating some LRU policy.  The actual decision to remove it
 616should use those policies (when possible); the kreffing is simply to avoid
 617issues from race conditions.  (A reader starts using a page right before it is
 618ripped from the mapping).
 619
 620x.5.3: More issues with Evictions
 621-----------
 622One issue with this is that dirty pages/buffers will need to be written back.
 623If someone tries to read while the page is removed from the page_mapping, but
 624before it is written back, they could get an old version if the read happens
 625before the write.  This is only an issue if the page is dirty.  One option
 626would be to writeback the page/buffer, then later remove it from the page
 627cache when it is read.  There's issues with concurrent writers, though if that
 628happens, we probably don't really want to remove it (it was LRU).  Note this
 629is an issue regardless of whether or not BHs are refcounted.  Also, this is
 630not an issue for when we kick a dentry/inode out of the cache - there should
 631be no one else trying to use it (since their refcnt was 0).  This is just a
 632concern when the system runs low on memory and wants to reclaim potentially
 633memory held by caches.
 634
 635Also note that writeback of pages will happen regardless of eviction plans
 636(fsync, every n sec, etc).
 637
 638This refcounting pattern is very similar to unlinking, where you can continue
 639to access a file once it is removed from the directory.  The difference here
 640is that future requests will get the same object (the page), unlike in the FS,
 641where you get ENOENT.  The page mapping is a cache, and you need to get the
 642same old data that was in there before the eviction.
 643
 644A final issue is when the VFS aggressively pins blockdev (metadata) buffers.
 645Ideally, we'd like to be able to expel pages/buffers even if they are
 646refcnt'd.  The subsystems will always want to keep stuff in RAM.  This
 647also/especially applies to mmap().  One solution would be to keep them in RAM,
 648but have the BH keep track of who is holding its reference.  Then we could
 649unmap the page, which would need to get read back in on its next access.  We'd
 650need (or ought to have) some sort of callbacks.  This will get solved later
 651when we deal with unmapping mmap'd files, since it is the same problem - just
 652with different code and actors.
 653
 654x.5.4: What about buffers inside pages?
 655-----------
 656For a while, I thought about refcounting BHs/buffers.  The issue that drives
 657it is the buffer cache (block dev page mapping holding metadata blocks of a
 658FS).  We had been operating on the cache in page-sized chunks, which
 659erroneously was reading in blocks adjacent to metadata blocks.  This would
 660have been an issue when we write back pages that have dirty blocks; blocks
 661were erroneously in the metadata cache and would overwrite potentially
 662file-realted blocks that were in an incoherent cache (the file/inode's page
 663mapping).
 664
 665We broke the bdev's buffer cache up into smaller BHs, so that only metadata
 666blocks get read in, but we eventually will have to get rid of a metadata block
 667(and not the entire page from the cache) (ex: an inode is removed from the
 668disk - its indirect blocks need to go, and they could be next to anything on
 669disk).  The desire for the BH refcnt came from wanting to rip the BHs out of
 670the list when it was time to evict them from the cache (in case they became
 671file blocks later).  It isn't clear what the best way to do this is.  Probably
 672we'd have all users refcnt the BHs, which refcnt the pages.  However, some
 673users (like mmap and the file page cache) operate in page chunks - so this
 674would require them to incref and decref the BHs.  Triggering the page reclaim
 675might also be tricky.  One option would be to just rip a page from the cache,
 676callback to whoever has it loaded so they fault it back in later, and
 677sleep-block everyone who interferes with the operation.  Yikes.
 678
 679Another option was to forget about the BH <-> page crap for the buffer cache,
 680and just have a standalone buffer cache with BHs as its unit of operation
 681(instead of a page), and replicate the algorithms/code for the buffer cache.
 682There is still a notion of BHs in a page, and page_release() / page_free()
 683would probably have to be a little different since its page isn't really a
 684PG_BUFFER (but it really is).
 685
 686Instead, here's what we do: we refcnt at page granularity, since all users can
 687be considered users of a page.  While it's not fine-grained, it does represent
 688the idea that someone doesn't want the page to be freed.  It's similar to
 689having a dentry be the refcnt source when someone really wants the inode/file.
 690When we want to remove a page from the block buffer cache, all we do is mark
 691it as not dirty and not up-to-date.  Now, whenever the page gets evicted from
 692the cache, we only write back those block buffers that are dirty, so the
 693recently evicted block will not be written back.  If we attempt to read the
 694block in the future (perhaps it is reallocated as a metablock), then the BH is
 695still there for the mapping, but is simply not up to date.  The code already
 696knows how to handle this (since it could happen during a race condition), and
 697it will simply read the buffer back in.  This new reading is necessary, since
 698it is possible that the block was used for file IO in between the uses of it
 699as a metablock.
 700
 701If there are more than one thread operating on a block buffer in the page
 702cache, then at the level of the cache, there is a race.  One could be marking
 703it as not dirty while another is dirtying it, etc.  However, no one should be
 704removing a buffer (aka, deallocating a block) while it is in use.  This sort
 705of concurrency problem should be sorted higher in the software stack (like at
 706the inode).
 707
 708On a similar note, no one should ever remove a block's buffer from the
 709metadata buffer cache if it is dirty.  When those removals happen, it means
 710the block should be dealloced on the block device - meaning no one cares what
 711happens to it.  It's not meant to have data preserved.
 712
 713x.6: What about Directories?  Inodes or metadata?
 714--------------------
 715Directories have inodes that have blocks scattered around the disk.  The blocks
 716making up the body of a directory are not sequential, like when you read blocks
 717from a bdev.  If you wanted a "page" of a directory, then you'd need to use the
 718i_mapping (page cache of a file) to access it, like any other file.
 719
 720However, we don't want a page of a directory - at least not yet.  The main
 721reason we can get away with this is due to the lack of a mmap() or a desire to
 722page-remap for directory-contents IO.  It's all for the kernel's internal use.
 723At no point does anyone call generic_file_read() or _write() on it.
 724
 725That being said, we treat the blocks of a directory as metadata blocks.  We do
 726figure out which blocks they are by walking the inode (also made of metadata
 727blocks), but don't bother to set up a page map for the directory itself.  We
 728just use the inode to figure out which metadata blocks we want, then read them
 729in (out of the blockdev's page cache).
 730
 731Note that userspace issues reads on the directory.  This is mostly a convenience
 732thing (or an inconvenience thing), which just ends up being a wrapper around
 733readdir() (via generic_dir_read()).
 734