processes.txt Barret Rhoden All things processes! This explains processes from a high level, especially focusing on the user-kernel boundary and transitions to the many-core state, which is the way in which parallel processes run. This doesn't discuss deep details of the ROS kernel's process code. This is motivated by two things: kernel scalability and direct support for parallel applications. Part 1: Overview Part 2: How They Work Part 3: Resource Requests Part 4: Preemption and Notification Part 5: Old Arguments (mostly for archival purposes)) Part 6: Parlab app use cases Revision History: 2009-10-30 - Initial version 2010-03-04 - Preemption/Notification, changed to many-core processes Part 1: World View of Processes ================================== A process is the lowest level of control, protection, and organization in the kernel. 1.1: What's a process? ------------------------------- Features: - They are an executing instance of a program. A program can load multiple other chunks of code and run them (libraries), but they are written to work with each other, within the same address space, and are in essence one entity. - They have one address space/ protection domain. - They run in Ring 3 / Usermode. - They can interact with each other, subject to permissions enforced by the kernel. - They can make requests from the kernel, for things like resource guarantees. They have a list of resources that are given/leased to them. None of these are new. Here's what's new: - They can run in a many-core mode, where its cores run at the same time, and it is aware of changes to these conditions (page faults, preemptions). It can still request more resources (cores, memory, whatever). - Every core in a many-core process (MCP) is *not* backed by a kernel thread/kernel stack, unlike with Linux tasks. - There are *no* per-core run-queues in the kernel that decide for themselves which kernel thread to run. - They are not fork()/execed(). They are created(), and then later made runnable. This allows the controlling process (parent) to do whatever it wants: pass file descriptors, give resources, whatever. These changes are directly motivated by what is wrong with current SMP operating systems as we move towards many-core: direct (first class) support for truly parallel processes, kernel scalability, and an ability of a process to see through classic abstractions (the virtual processor) to understand (and make requests about) the underlying state of the machine. 1.2: What's a partition? ------------------------------- So a process can make resource requests, but some part of the system needs to decide what to grant, when to grant it, etc. This goes by several names: scheduler / resource allocator / resource manager. The scheduler simply says when you get some resources, then calls functions from lower parts of the kernel to make it happen. This is where the partitioning of resources comes in. In the simple case (one process per partitioned block of resources), the scheduler just finds a slot and runs the process, giving it its resources. A big distinction is that the *partitioning* of resources only makes sense from the scheduler on up in the stack (towards userspace). The lower levels of the kernel know about resources that are granted to a process. The partitioning is about the accounting of resources and an interface for adjusting their allocation. It is a method for telling the 'scheduler' how you want resources to be granted to processes. A possible interface for this is procfs, which has a nice hierarchy. Processes can be grouped together, and resources can be granted to them. Who does this? A process can create it's own directory entry (a partition), and move anyone it controls (parent of, though that's not necessary) into its partition or a sub-partition. Likewise, a sysadmin/user can simply move PIDs around in the tree, creating partitions consisting of processes completely unaware of each other. Now you can say things like "give 25% of the system's resources to apache and mysql". They don't need to know about each other. If you want finer-grained control, you can create subdirectories (subpartitions), and give resources on a per-process basis. This is back to the simple case of one process for one (sub)partition. This is all influenced by Linux's cgroups (process control groups). http://www.mjmwired.net/kernel/Documentation/cgroups.txt. They group processes together, and allow subsystems to attach meaning to those groups. Ultimately, I view partitioning as something that tells the kernel how to grant resources. It's an abstraction presented to userspace and higher levels of the kernel. The specifics still need to be worked out, but by separating them from the process abstraction, we can work it out and try a variety of approaches. The actual granting of resources and enforcement is done by the lower levels of the kernel (or by hardware, depending on future architectural changes). Part 2: How They Work =============================== 2.1: States ------------------------------- PROC_CREATED PROC_RUNNABLE_S PROC_RUNNING_S PROC_WAITING PROC_DYING PROC_DYING_ABORT PROC_RUNNABLE_M PROC_RUNNING_M Difference between the _M and the _S states: - _S : legacy process mode. There is no need for a second-level scheduler, and the code running is analogous to a user-level thread. - RUNNING_M implies *guaranteed* core(s). You can be a single core in the RUNNING_M state. The guarantee is subject to time slicing, but when you run, you get all of your cores. - The time slicing is at a coarser granularity for _M states. This means that when you run an _S on a core, it should be interrupted/time sliced more often, which also means the core should be classified differently for a while. Possibly even using it's local APIC timer. - A process in an _M state will be informed about changes to its state, e.g., will have a handler run in the event of a page fault For more details, check out kern/inc/process.h For valid transitions between these, check out kern/src/process.c's proc_set_state(). 2.2: Creation and Running ------------------------------- Unlike the fork-exec model, processes are created, and then explicitly made runnable. In the time between creation and running, the parent (or another controlling process) can do whatever it wants with the child, such as pass specific file descriptors, map shared memory regions (which can be used to pass arguments). New processes are not a copy-on-write version of the parent's address space. Due to our changes in the threading model, we no longer need (or want) this behavior left over from the fork-exec model. By splitting the creation from the running and by explicitly sharing state between processes (like inherited file descriptors), we avoid a lot of concurrency and security issues. 2.3: Vcoreid vs Pcoreid ------------------------------- The vcoreid is a virtual cpu number. Its purpose is to provide an easy way for the kernel and userspace to talk about the same core. pcoreid (physical) would also work. The vcoreid makes things a little easier, such as when a process wants to refer to one of its other cores (not the calling core). It also makes the event notification mechanisms easier to specify and maintain. Processes that care about locality should check what their pcoreid is. This is currently done via sys_getcpuid(). The name will probably change. 2.4: Transitioning to and from states ------------------------------- 2.4.1: To go from _S to _M, a process requests cores. -------------- A resource request from 0 to 1 or more causes a transition from _S to _M. The calling context is saved in the uthread slot (uthread_ctx) in vcore0's preemption data (in procdata). The second level scheduler needs to be able to restart the context when vcore0 starts up. To do this, it will need to save the TLS/TCB descriptor and the floating point/silly state (if applicable) in the user-thread control block, and do whatever is needed to signal vcore0 to run the _S context when it starts up. One way would be to mark vcore0's "active thread" variable to point to the _S thread. When vcore0 starts up at _start/vcore_entry() (like all vcores), it will see a thread was running there and restart it. The kernel will migrate the _S thread's silly state (FP) to the new pcore, so that it looks like the process was simply running the _S thread and got notified. Odds are, it will want to just restart that thread, but the kernel won't assume that (hence the notification). In general, all cores (and all subsequently allocated cores) start at the elf entry point, with vcoreid in eax or a suitable arch-specific manner. There is also a syscall to get the vcoreid, but this will save an extra trap at vcore start time. Future proc_runs(), like from RUNNABLE_M to RUNNING_M start all cores at the entry point, including vcore0. The saving of a _S context to vcore0's uthread_ctx only happens on the transition from _S to _M (which the process needs to be aware of for a variety of reasons). This also means that userspace needs to handle vcore0 coming up at the entry point again (and not starting the program over). This is currently done in sysdeps-ros/start.c, via the static variable init. Note there are some tricky things involving dynamically linked programs, but it all works currently. When coming in to the entry point, whether as the result of a startcore or a notification, the kernel will set the stack pointer to whatever is requested by userspace in procdata. A process should allocate stacks of whatever size it wants for its vcores when it is in _S mode, and write these location to procdata. These stacks are the transition stacks (in Lithe terms) that are used as jumping-off points for future function calls. These stacks need to be used in a continuation-passing style, and each time they are used, they start from the top. 2.4.2: To go from _M to _S, a process requests 0 cores -------------- The caller becomes the new _S context. Everyone else gets trashed (abandon_core()). Their stacks are still allocated and it is up to userspace to deal with this. In general, they will regrab their transition stacks when they come back up. Their other stacks and whatnot (like TBB threads) need to be dealt with. When the caller next switches to _M, that context (including its stack) maintains its old vcore identity. If vcore3 causes the switch to _S mode, it ought to remain vcore3 (lots of things get broken otherwise). As of March 2010, the code does not reflect this. Don't rely on anything in this section for the time being. 2.4.3: Requesting more cores while in _M -------------- Any core can request more cores and adjust the resource allocation in any way. These new cores come up just like the original new cores in the transition from _S to _M: at the entry point. 2.4.4: Yielding -------------- sys_yield()/proc_yield() will give up the calling core, and may or may not adjust the desired number of cores, subject to its parameters. Yield is performing two tasks, both of which result in giving up the core. One is for not wanting the core anymore. The other is in response to a preemption. Yield may not be called remotely (ARSC). In _S mode, it will transition from RUNNING_S to RUNNABLE_S. The context is saved in scp_ctx. In _M mode, this yields the calling core. A yield will *not* transition from _M to _S. The kernel will rip it out of your vcore list. A process can yield its cores in any order. The kernel will "fill in the holes of the vcoremap" for any future new cores requested (e.g., proc A has 4 vcores, yields vcore2, and then asks for another vcore. The new one will be vcore2). When any core starts in _M mode, even after a yield, it will come back at the vcore_entry()/_start point. Yield will normally adjust your desired amount of vcores to the amount after the calling core is taken. This is the way a process gives its cores back. Yield can also be used to say the process is just giving up the core in response to a pending preemption, but actually wants the core and does not want resource requests to be readjusted. For example, in the event of a preemption notification, a process may yield (ought to!) so that the kernel does not need to waste effort with full preemption. This is done by passing in a bool (being_nice), which signals the kernel that it is in response to a preemption. The kernel will not readjust the amt_wanted, and if there is no preemption pending, the kernel will ignore the yield. There may be an m_yield(), which will yield all or some of the cores of an MPC, remotely. This is discussed farther down a bit. It's not clear what exactly it's purpose would be. We also haven't addressed other reasons to yield, or more specifically to wait, such as for an interrupt or an event of some sort. 2.4.5: Others -------------- There are other transitions, mostly self-explanatory. We don't currently use any WAITING states, since we have nothing to block on yet. DYING is a state when the kernel is trying to kill your process, which can take a little while to clean up. Part 3: Resource Requests =============================== A process can ask for resources from the kernel. The kernel either grants these requests or not, subject to QoS guarantees, or other scheduler-related criteria. A process requests resources, currently via sys_resource_req. The form of a request is to tell the kernel how much of a resource it wants. Currently, this is the amt_wanted. We'll also have a minimum amount wanted, which tells the scheduler not to run the process until the minimum amount of resources are available. How the kernel actually grants resources is resource-specific. In general, there are functions like proc_give_cores() (which gives certain cores to a process) that actually does the allocation, as well as adjusting the amt_granted for that resource. For expressing QoS guarantees, we'll probably use something like procfs (as mentioned above) to explicitly tell the scheduler/resource manager what the user/sysadmin wants. An interface like this ought to be usable both by programs as well as simple filesystem tools (cat, etc). Guarantees exist regardless of whether or not the allocation has happened. An example of this is when a process may be guaranteed to use 8 cores, but currently only needs 2. Whenever it asks for up to 8 cores, it will get them. The exact nature of the guarantee is TBD, but there will be some sort of latency involved in the guarantee for systems that want to take advantage of idle resources (compared to simply reserving and not allowing anyone else to use them). A latency of 0 would mean a process wants it instantly, which probably means they ought to be already allocated (and billed to) that process. Part 4: Preemption and Event Notification =============================== Preemption and Notification are tied together. Preemption is when the kernel takes a resource (specifically, cores). There are two types core_preempt() (one core) and gang_preempt() (all cores). Notification (discussed below) is when the kernel informs a process of an event, usually referring to the act of running a function on a core (active notification). The rough plan for preemption is to notify beforehand, then take action if userspace doesn't yield. This is a notification a process can ignore, though it is highly recommended to at least be aware of impending core_preempt() events. 4.1: Notification Basics ------------------------------- One of the philosophical goals of ROS is to expose information up to userspace (and allow requests based on that information). There will be a variety of events in the system that processes will want to know about. To handle this, we'll eventually build something like the following. All events will have a number, like an interrupt vector. Each process will have an event queue (per core, described below). On most architectures, it will be a simple producer-consumer ring buffer sitting in the "shared memory" procdata region (shared between the kernel and userspace). The kernel writes a message into the buffer with the event number and some other helpful information. Additionally, the process may request to be actively notified of specific events. This is done by having the process write into an event vector table (like an IDT) in procdata. For each event, the process writes the vcoreid it wants to be notified on. 4.2: Notification Specifics ------------------------------- In procdata there is an array of per-vcore data, holding some preempt/notification information and space for two trapframes: one for notification and one for preemption. 4.2.1: Overall ----------------------------- When a notification arrives to a process under normal circumstances, the kernel places the previous running context in the notification trapframe, and returns to userspace at the program entry point (the elf entry point) on the transition stack. If a process is already handling a notification on that core, the kernel will not interrupt it. It is the processes's responsibility to check for more notifications before returning to its normal work. The process must also unmask notifications (in procdata) before it returns to do normal work. Unmasking notifications is the signal to the kernel to not bother sending IPIs, and if an IPI is sent before notifications are masked, then the kernel will double-check this flag to make sure interrupts should have arrived. Notification unmasking is done by clearing the notif_disabled flag (similar to turning interrupts on in hardware). When a core starts up, this flag is on, meaning that notifications are disabled by default. It is the process's responsibility to turn on notifications for a given vcore. 4.2.2: Notif Event Details ----------------------------- When the process runs the handler, it is actually starting up at the same location in code as it always does. To determine if it was a notification or not, simply check the queue and bitmask. This has the added benefit of allowing a process to notice notifications that it missed previously, or notifs it wanted without active notification (IPI). If we want to bypass this check by having a magic register signal, we can add that later. Additionally, the kernel will mask notifications (much like an x86 interrupt gate). It will also mask notifications when starting a core with a fresh trapframe, since the process will be executing on its transition stack. The process must check its per-core event queue to see why it was called, and deal with all of the events on the queue. In the case where the event queue overflows, the kernel will up a counter so the process can at least be aware things are missed. At the very least, the process will see the notification marked in a bitmask. These notification events include things such as: an IO is complete, a preemption is pending to this core, the process just returned from a preemption, there was a trap (divide by 0, page fault), and many other things. We plan to allow this list to grow at runtime (a process can request new event notification types). These messages will often need some form of a timestamp, especially ones that will expire in meaning (such as a preempt_pending). Note that only one notification can be active at a time, including a fault. This means that if a process page faults or something while notifications are masked, the process will simply be killed. It is up to the process to make sure the appropriate pages are pinned, which it should do before entering _M mode. 4.2.3: Event Overflow and Non-Messages ----------------------------- For missed/overflowed events, and for events that do not need messages (they have no parameters and multiple notifications are irrelevant), the kernel will toggle that event's bit in a bitmask. For the events that don't want messages, we may have a flag that userspace sets, meaning they just want to know it happened. This might be too much of a pain, so we'll see. For notification events that overflowed the queue, the parameters will be lost, but hopefully the application can sort it out. Again, we'll see. A specific notif_event should not appear in both the event buffers and in the bitmask. It does not make sense for all events to have messages. Others, it does not make sense to specify a different core on which to run the handler (e.g. page faults). The notification methods that the process expresses via procdata are suggestions to the kernel. When they don't make sense, they will be ignored. Some notifications might be unserviceable without messages. A process needs to have a fallback mechanism. For example, they can read the vcoremap to see who was lost, or they can restart a thread to cause it to page fault again. Event overflow sucks - it leads to a bunch of complications. Ultimately, what we really want is a limitless amount of notification messages (per core), as well as a limitless amount of notification types. And we want these to be relayed to userspace without trapping into the kernel. We could do this if we had a way to dynamically manage memory in procdata, with a distrusted process on one side of the relationship. We could imagine growing procdata dynamically (we plan to, mostly to grow the preempt_data struct as we request more vcores), and then run some sort of heap manager / malloc. Things get very tricky since the kernel should never follow pointers that userspace can touch. Additionally, whatever memory management we use becomes a part of the kernel interface. Even if we had that, dynamic notification *types* is tricky - they are identified by a number, not by a specific (list) element. For now, this all seems like an unnecessary pain in the ass. We might adjust it in the future if we come up with clean, clever ways to deal with the problem, which we aren't even sure is a problem yet. 4.2.4: How to Use and Leave a Transition Stack ----------------------------- We considered having the kernel be aware of a process's transition stacks and sizes so that it can detect if a vcore is in a notification handler based on the stack pointer in the trapframe when a trap or interrupt fires. While cool, the flag for notif_disabled is much easier and just as capable. Userspace needs to be aware of various races, and only enable notifications when it is ready to have its transition stack clobbered. This means that when switching from big user-thread to user-thread, the process should temporarily disable notifications and reenable them before starting the new thread fully. This is analogous to having a kernel that disables interrupts while in process context. A process can fake not being on its transition stack, and even unmapping their stack. At worst, a vcore could recursively page fault (the kernel does not know it is in a handler, if they keep enabling notifs before faulting), and that would continue til the core is forcibly preempted. This is not an issue for the kernel. When a process wants to use its transition stack, it ought to check preempt_pending, mask notifications, jump to its transition stack, do its work (e.g. process notifications, check for new notifications, schedule a new thread) periodically checking for a pending preemption, and making sure the notification queue/list is empty before moving back to real code. Then it should jump back to a real stack, unmask notifications, and jump to the newly scheduled thread. This can be really tricky. When userspace is changing threads, it will need to unmask notifs as well as jump to the new thread. There is a slight race here, but it is okay. The race is that an IPI can arrive after notifs are unmasked, but before returning to the real user thread. Then the code will think the uthread_ctx represents the new user thread, even though it hasn't started (and the PC is wrong). The trick is to make sure that all state required to start the new thread, as well as future instructions, are all saved within the "stuff" that gets saved in the uthread_ctx. When these threading packages change contexts, they ought to push the PC on the stack of the new thread, (then enable notifs) and then execute a return. If an IPI arrives before the "function return", then when that context gets restarted, it will run the "return" with the appropriate value on the stack still. There is a further complication. The kernel can send an IPI that the process wanted, but the vcore did not get truly interrupted since its notifs were disabled. There is a race between checking the queue/bitmask and then enabling notifications. The way we deal with it is that the kernel posts the message/bit, then sets notif_pending. Then it sends the IPI, which may or may not be received (based on notif_disabled). (Actually, the kernel only ought to send the IPI if notif_pending was 0 (atomically) and notif_disabled is 0). When leaving the transition stack, userspace should clear the notif_pending, then check the queue do whatever, and then try to pop the tf. When popping the tf, after enabling notifications, check notif_pending. If it is still clear, return without fear of missing a notif. If it is not clear, it needs to manually notify itself (sys_self_notify) so that it can process the notification that it missed and for which it wanted to receive an IPI. Before it does this, it needs to clear notif_pending, so the kernel will send it an IPI. These last parts are handled in pop_user_ctx(). 4.3: Preemption Specifics ------------------------------- There's an issue with a preempted vcore getting restarted while a remote core tries to restart that context. They resolve this fight with a variety of VC flags (VC_UTHREAD_STEALING). Check out handle_preempt() in uthread.c. 4.4: Other trickiness ------------------------------- Take all of these with a grain of salt - it's quite old. 4.4.1: Preemption -> deadlock ------------------------------- One issue is that a context can be holding a lock that is necessary for the userspace scheduler to manage preempted threads, and this context can be preempted. This would deadlock the scheduler. To assist a process from locking itself up, the kernel will toggle a preempt_pending flag in procdata for that vcore before sending the actual preemption. Whenever the scheduler is grabbing one of these critical spinlocks, it needs to check that flag first, and yield if a preemption is coming in. Another option we may implement is for the process to be able to signal to the kernel that it is in one of these ultra-critical sections by writing a magic value to a specific register in the trapframe. If there kernel sees this, it will allow the process to run for a little longer. The issue with this is that the kernel would need to assume processes will always do this (malicious ones will) and add this extra wait time to the worst case preemption time. Finally, a scheduler could try to use non-blocking synchronization (no spinlocks), or one of our other long-term research synchronization methods to avoid deadlock, though we realize this is a pain for userspace for now. FWIW, there are some OSs out there with only non-blocking synchronization (I think). 4.4.2: Cascading and overflow ------------------------------- There used to be issues with cascading interrupts (when contexts are still running handlers). Imagine a pagefault, followed by preempting the handler. It doesn't make sense to run the preempt context after the page fault. Earlier designs had issues where it was hard for a vcore to determine the order of events and unmixing preemption, notification, and faults. We deal with this by having separate slots for preemption and notification, and by treating faults as another form of notification. Faulting while handling a notification just leads to death. Perhaps there is a better way to do that. Another thing we considered would be to have two stacks - transition for notification and an exception stack for faults. We'd also need a fault slot for the faulting trapframe. This begins to take up even more memory, and it is not clear how to handle mixed faults and notifications. If you fault while on the notification slot, then fine. But you could fault for other reasons, and then receive a notification. And then if you fault in that handler, we're back to where we started - might as well just kill them. Another issue was overload. Consider if vcore0 is set up to receive all events. If events come in faster than it can process them, it will both nest too deep and process out of order. To handle this, we only notify once, and will not send future active notifications / interrupts until the process issues an "end of interrupt" (EOI) for that vcore. This is modelled after hardware interrupts (on x86, at least). 4.4.3: Restarting a Preempted Notification ------------------------------- Nowadays, to restart a preempted notification, you just restart the vcore. The kernel does, either if it gives the process more cores or if userspace asked it to with a sys_change_vcore(). 4.4.4: Userspace Yield Races ------------------------------- Imagine a vcore realizes it is getting preempted soon, so it starts to yield. However, it is too slow and doesn't make it into the kernel before a preempt message takes over. When that vcore is run again, it will continue where it left off and yield its core. The desired outcome is for yield to fail, since the process doesn't really want to yield that core. To sort this out, yield will take a parameter saying that the yield is in response to a pending preemption. If the phase is over (preempted and returned), the call will not yield and simply return to userspace. 4.4.5: Userspace m_yield ------------------------------- There are a variety of ways to implement an m_yield (yield the entire MCP). We could have a "no niceness" yield - just immediately preempt, but there is a danger of the locking business. We could do the usual delay game, though if userspace is requesting its yield, arguably we don't need to give warning. Another approach would be to not have an explicit m_yield call. Instead, we can provide a notify_all call, where the notification sent to every vcore is to yield. I imagine we'll have a notify_all (or rather, flags to the notify call) anyway, so we can do this for now. The fastest way will probably be the no niceness way. One way to make this work would be for vcore0 to hold all of the low-level locks (from 4.4.1) and manually unlock them when it wakes up. Yikes! 4.5: Random Other Stuff ------------------------------- Pre-Notification issues: how much time does userspace need to clean up and yield? How quickly does the kernel need the core back (for scheduling reasons)? Part 5: Old Arguments about Processes vs Partitions =============================== This is based on my interpretation of the cell (formerly what I thought was called a partition). 5.1: Program vs OS ------------------------------- A big difference is what runs inside the object. I think trying to support OS-like functionality is a quick path to unnecessary layers and complexity, esp for the common case. This leads to discussions of physical memory management, spawning new programs, virtualizing HW, shadow page tables, exporting protection rings, etc. This unnecessarily brings in the baggage and complexity of supporting VMs, which are a special case. Yes, we want processes to be able to use their resources, but I'd rather approach this from the perspective of "what do they need?" than "how can we make it look like a real machine." Virtual machines are cool, and paravirtualization influenced a lot of my ideas, but they have their place and I don't think this is it. For example, exporting direct control of physical pages is a bad idea. I wasn't clear if anyone was advocating this or not. By exposing actual machine physical frames, we lose our ability to do all sorts of things (like swapping, for all practical uses, and other VM tricks). If the cell/process thinks it is manipulating physical pages, but really isn't, we're in the VM situation of managing nested or shadow page tables, which we don't want. For memory, we'd be better off giving an allocation of a quantity frames, not specific frames. A process can pin up to X pages, for instance. It can also pick pages to be evicted when there's memory pressure. There are already similar ideas out there, both in POSIX and in ACPM. Instead of mucking with faking multiple programs / entities within an cell, just make more processes. Otherwise, you'd have to export weird controls that the kernel is doing anyway (and can do better!), and have complicated middle layers. 5.2: Multiple "Things" in a "partition" ------------------------------- In the process-world, the kernel can make a distinction between different entities that are using a block of resources. Yes, "you" can still do whatever you want with your resources. But the kernel directly supports useful controls that you want. - Multiple protection domains are no problem. They are just multiple processes. Resource allocation is a separate topic. - Processes can control one another, based on a rational set of rules. Even if you have just cells, we still need them to be able to control one another (it's a sysadmin thing). "What happens in a cell, stays in a cell." What does this really mean? If it's about resource allocation and passing of resources around, we can do that with process groups. If it's about the kernel not caring about what code runs inside a protection domain, a process provides that. If it's about a "parent" program trying to control/kill/whatever a "child" (even if it's within a cell, in the cell model), you *want* the kernel to be involved. The kernel is the one that can do protection between entities. 5.3: Other Things ------------------------------- Let the kernel do what it's made to do, and in the best position to do: manage protection and low-level resources. Both processes and partitions "have" resources. They are at different levels in the system. A process actually gets to use the resources. A partition is a collection of resources allocated to one or more processes. In response to this: On 2009-09-15 at 22:33 John Kubiatowicz wrote: > John Shalf wrote: > > > > Anyhow, Barret is asking that resource requirements attributes be > > assigned on a process basis rather than partition basis. We need > > to justify why gang scheduling of a partition and resource > > management should be linked. I want a process to be aware of it's specific resources, as well as the other members of it's partition. An individual process (which is gang-scheduled in many-core mode) has a specific list of resources. Its just that the overall 'partition of system resources' is separate from the list of specific resources of a process, simply because there can be many processes under the same partition (collection of resources allocated). > > > Simplicity! > > Yes, we can allow lots of options, but at the end of the day, the > simplest model that does what we need is likely the best. I don't > want us to hack together a frankenscheduler. My view is also simple in the case of one address space/process per 'partition.' Extending it to multiple address spaces is simply asking that resources be shared between processes, but without design details that I imagine will be brutally complicated in the Cell model. Part 6: Use Cases =============================== 6.1: Matrix Multiply / Trusting Many-core app ------------------------------- The process is created by something (bash, for instance). It's parent makes it runnable. The process requests a bunch of cores and RAM. The scheduler decides to give it a certain amount of resources, which creates it's partition (aka, chunk of resources granted to it's process group, of which it is the only member). The sysadmin can tweak this allocation via procfs. The process runs on its cores in it's many-core mode. It is gang scheduled, and knows how many cores there are. When the kernel starts the process on it's extra cores, it passes control to a known spot in code (the ELF entry point), with the virtual core id passed as a parameter. The code runs from a single binary image, eventually with shared object/library support. It's view of memory is a virtual address space, but it also can see it's own page tables to see which pages are really resident (similar to POSIX's mincore()). When it comes time to lose a core, or be completely preempted, the process is notified by the OS running a handler of the process's choosing (in userspace). The process can choose what to do (pick a core to yield, prepare to be preempted, etc). To deal with memory, the process is notified when it page faults, and keeps its core. The process can pin pages in memory. If there is memory pressure, the process can tell the kernel which pages to unmap. This is the simple case. 6.2: Browser ------------------------------- In this case, a process wants to create multiple protection domains that share the same pool of resources. Or rather, with it's own allocated resources. The browser process is created, as above. It creates, but does not run, it's untrusted children. The kernel will have a variety of ways a process can "mess with" a process it controls. So for this untrusted child, the parent can pass (for example), a file descriptor of what to render, "sandbox" that process (only allow a whitelist of syscalls, e.g. can only read and write descriptors it has). You can't do this easily in the cell model. The parent can also set up a shared memory mapping / channel with the child. For resources, the parent can put the child in a subdirectory/ subpartition and give a portion of its resources to that subpartition. The scheduler will ensure that both the parent and the child are run at the same time, and will give the child process the resources specified. (cores, RAM, etc). After this setup, the parent will then make the child "runnable". This is why we want to separate the creation from the runnability of a process, which we can't do with the fork/exec model. The parent can later kill the child if it wants, reallocate the resources in the partition (perhaps to another process rendering a more important page), preempt that process, whatever. 6.3: SMP Virtual Machines ------------------------------- The main issue (regardless of paravirt or full virt), is that what's running on the cores may or may not trust one another. One solution is to run each VM-core in it's own process (like with Linux's KVM, it uses N tasks (part of one process) for an N-way SMP VM). The processes set up the appropriate shared memory mapping between themselves early on. Another approach would be to allow a many-cored process to install specific address spaces on each core, and interpose on syscalls, privileged instructions, and page faults. This sounds very much like the Cell approach, which may be fine for a VM, but not for the general case of a process. Or with a paravirtualized SMP guest, you could (similar to the L4Linux way,) make any Guest OS processes actual processes in our OS. The resource allocation to the Guest OS partition would be managed by the parent process of the group (which would be running the Guest OS kernel). We still need to play tricks with syscall redirection. For full virtualization, we'd need to make use of hardware virtualization instructions. Dealing with the VMEXITs, emulation, and other things is a real pain, but already done. The long range plan was to wait til the http://v3vee.org/ project supported Intel's instructions and eventually incorporate that. All of these ways involve subtle and not-so-subtle difficulties. The Cell-as-OS mode will have to deal with them for the common case, which seems brutal. And rather unnecessary.