akaros.git
3 years agoslab: Bootstrap more kmem_caches
Barret Rhoden [Sun, 30 Oct 2016 22:23:00 +0000 (18:23 -0400)]
slab: Bootstrap more kmem_caches

This statically allocates all of the boot-strapping caches.  This is not
strictly necessary, but it could be if the hash table default size was
enough to make a kmem_cache a large slab object.  At that point, we'd need
all three bootstrap caches to allocate one.  This way, we have less
bootstrapping to worry about.  We'll have more to worry about when we start
using magazines.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoslab: Use BSD_LISTs for the bufctls
Barret Rhoden [Sun, 30 Oct 2016 22:23:00 +0000 (18:23 -0400)]
slab: Use BSD_LISTs for the bufctls

The slab allocator has a long-standing TODO: BUF.  Instead of using a
hash table to lookup a large object, we just used storage in the object
itself.  This was okay, other than possible fragmentation effects, but
it meant that the slab allocator touched every object it tracked.  We'll
eventually need an option to have "NO_TOUCH" slab allocators, where they
do not touch the objects they are tracking.  To do that, we'll need a
hash table.

This commit switches the bufctl struct from a TAILQ to a BSD_LIST, which
will make the hash table entries smaller.  This also fixes a FOREACH
freeing bug. (use FOREACH_SAFE).

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoSet num_cores early in boot
Barret Rhoden [Sun, 6 Nov 2016 16:21:02 +0000 (11:21 -0500)]
Set num_cores early in boot

The memory allocator will need to know the number of cores in the
system when it is initialized.  In the future, it may also need to know
the number of NUMA domains.  Determining the number of cores is somewhat
arch-specific.  We can do it with ACPI on x86, and on any other platform
that supports it.

Our ACPI code relies on the memory allocator and does a lot more than
determine the number of cores, so we have a simple helper that just
looks at the ACPI tables, finds the XSDT, then finds the MADT, then
counts the local apics.  We'll use this as num_cores (possibly an
overestimate).  The topology code will make sure we didn't
underestimate later in boot.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoCheck booting during trace_printk()
Barret Rhoden [Sun, 6 Nov 2016 16:18:52 +0000 (11:18 -0500)]
Check booting during trace_printk()

Instead of num_cores.  This is safer, in case we set num_cores before
various per-cpu structures are set up.

The reason for this is that the memory allocator will need to know about
num_cores, and that will happen very early in the booting process.

trace_printk() will be fine if it just uses the boot object instead of
per-cpu objects during boot.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoMoving 'booting' to a header
Barret Rhoden [Sun, 6 Nov 2016 16:17:56 +0000 (11:17 -0500)]
Moving 'booting' to a header

Instead of externing it in random places.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoReplace the old page allocator with the base arena
Barret Rhoden [Fri, 28 Oct 2016 20:53:46 +0000 (16:53 -0400)]
Replace the old page allocator with the base arena

The old allocator couldn't handle higher order allocations efficiently.
As memory was used, it'd take longer and longer to find contiguous
pages.

We bootstrap the base arena and add free segments to it based on the
free memory regions of multiboot.  The kpages_arena is used for the
main pages allocator.  Right now, it's just a pass-through arena that
imports from base.  In the future, it'll have its own qcaches built in,
which will make common allocations even faster.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd the arena allocator
Barret Rhoden [Fri, 28 Oct 2016 20:22:26 +0000 (16:22 -0400)]
Add the arena allocator

The arena allocator is based off of the Vmem allocator:

http://www.google.com/search?q=bonwick+vmem

This will be the basis for all memory allocation.  Right now, it does
not have integrated qcaches (slabs).  That will require some work with
the slab allocator.  You can build a jumbo page allocator, using a
helper that xallocs with an alignment, which is pretty cool.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd hash_helper.h for custom dynamic hash tables
Barret Rhoden [Tue, 1 Nov 2016 00:25:21 +0000 (20:25 -0400)]
Add hash_helper.h for custom dynamic hash tables

The full-fledged dynamic hashtable.c doesn't work for a lot of code that
needs more control over its hash table.  For instance, the arena
allocator needs fine-grained control over allocations and a node's list
membership.

This header is a few building-block helpers that allow you to build your
own dynamically resized hash table.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoPort hash.h
Barret Rhoden [Mon, 31 Oct 2016 23:30:31 +0000 (19:30 -0400)]
Port hash.h

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoImport hash.h from Linux
Barret Rhoden [Mon, 31 Oct 2016 23:25:09 +0000 (19:25 -0400)]
Import hash.h from Linux

Version 4.6, which was before all the arch-specific additions.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd a #define for all MEM_FLAGS
Barret Rhoden [Fri, 28 Oct 2016 20:21:42 +0000 (16:21 -0400)]
Add a #define for all MEM_FLAGS

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove mon_gfp()
Barret Rhoden [Thu, 27 Oct 2016 00:24:47 +0000 (20:24 -0400)]
Remove mon_gfp()

Unused, and it was a hack into the old allocator.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agomlx4: Remove page_is_free() safety check
Barret Rhoden [Thu, 27 Oct 2016 00:11:28 +0000 (20:11 -0400)]
mlx4: Remove page_is_free() safety check

With the arena allocator, it won't be easy to query the state of a given
page.  We actually could do that, if we wanted, with an arena helper
that looks up the btag for a given address, but it's a pain, it won't be
fast, and it will probably not work well with NUMA.

Considering this style of page pinning needs to change anyways, we might
as well remove it.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agox86: Stop freeing the trampoline page
Barret Rhoden [Thu, 27 Oct 2016 00:07:42 +0000 (20:07 -0400)]
x86: Stop freeing the trampoline page

The arena allocator won't let us free something it never allocated.  The
pages[] based allocator didn't care, since we massaged the refcnts the
right way during page_alloc_init().

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoJump to a real kstack ASAP during boot
Barret Rhoden [Wed, 26 Oct 2016 23:40:50 +0000 (19:40 -0400)]
Jump to a real kstack ASAP during boot

We actually were using the bootstack, which was never actually given out
by a memory allocator, for a long time.  Eventually, we'd give it back,
when the kthread code thought it was a spare it needed to free.  This
would confuse the arena allocator, which never gave out the memory in
the first place.

Now, we'll switch to using a kernel stack that was given to us by
get_kstack() right away.  This helps both with the allocator as well as
with whatever safety checks we'll use for the kernel stacks (e.g. guard
pages).  It'd be brutal if we had one unlucky kernel stack that didn't
have the protections we thought all stacks had (or will have, in this
case).

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agox86: set pcpui->{ts,gdt} early
Barret Rhoden [Wed, 26 Oct 2016 22:17:32 +0000 (18:17 -0400)]
x86: set pcpui->{ts,gdt} early

This allows us to set/get the stacktop with the usual, arch-independent
helper early.  I'll need this during init, before smp_boot.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoUse a helper for resetting kernel stacks
Barret Rhoden [Wed, 26 Oct 2016 20:40:37 +0000 (16:40 -0400)]
Use a helper for resetting kernel stacks

It's another arch-specific helper, but I have another case in an
upcoming commit that will need to pass the function pointer.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoIntegrate rbtrees into Akaros
Barret Rhoden [Thu, 13 Oct 2016 15:24:29 +0000 (11:24 -0400)]
Integrate rbtrees into Akaros

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoImport rbtrees from Linux
Barret Rhoden [Thu, 13 Oct 2016 14:57:04 +0000 (10:57 -0400)]
Import rbtrees from Linux

From Linux commit 9a2172a8d52c ("MAINTAINERS: Switch to kernel.org email
address for Javi Merino")

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoMove __always_inline to compiler.h
Barret Rhoden [Thu, 13 Oct 2016 15:15:26 +0000 (11:15 -0400)]
Move __always_inline to compiler.h

So other code can use it.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove get_cont_phys_pages_at()
Barret Rhoden [Thu, 18 Aug 2016 17:42:20 +0000 (13:42 -0400)]
Remove get_cont_phys_pages_at()

I think this was for some weird debugging code, or maybe the old NIX mode.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove page coloring
Barret Rhoden [Thu, 18 Aug 2016 16:02:22 +0000 (12:02 -0400)]
Remove page coloring

Page coloring doesn't work with contiguous memory allocators, and it
partitions all levels of the cache hierarchy, which doesn't work well with
spatial partitioning.  For instance, if we partition the L3 into 8 colors
(the number is based on the cache properties), we might be partitioning the
L1 and L2 into two colors (again, based on cache properties).  Although we
now have cache isolation in the shared LLC, we also partition a cache that
is already per-core.

The better approach is to use some sort of hardware support, such as
Intel's Cache Allocation Technology.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove page refcnts
Barret Rhoden [Wed, 17 Aug 2016 21:22:10 +0000 (17:22 -0400)]
Remove page refcnts

page_decref() is just page_free(), now.  I'll do a rename in a later
commit.  We still needed to track if it was free or not for the currently
lousy memory allocator.

There might be issues with this, but if you aren't willing to potentially
break compatibility with Linux, then you'll never get anywhere.

There are a few reasons to do reference counts.  Only one we still have is
for devices that want to pin user memory for operations.  Specifically, the
mlx4 OS-bypass stuff does this.  The problem is that the user allocs memory
and gives arbitrary addresses to the device.  Instead, we should have the
device own the memory and let the user mmap the memory.  That gets rid of
any issues with locking the page, since the memory is always 'safe.'

That model doesn't work with traditional scatter-gather.  Worst case, we
can come up with something where we lock the VMR, instead of the page.
Though I'd rather come up with more explicit block data transfer
interfaces.

Note that the mlx4 OS-bypass is extremely dangerous now.  I think it was
always leaking memory before, btw.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoProvide a shim layer for reference counted pages
Barret Rhoden [Tue, 23 Aug 2016 20:54:38 +0000 (16:54 -0400)]
Provide a shim layer for reference counted pages

Right now, all pages are reference counted.  I'd like to try to stop doing
that to make contig allocations and maybe jumbo pages easier.  Longer term,
I'd like to get away from having a page struct too, though we'll see.

Some code, specifically mlx4, wants page allocations and to do reference
counting per page.  For that code, we provide this shim.

It actually looks like there are some bugs in mlx4's allocation/freeing
code, and how they account for fragments and references for higher-order
allocations.  Linux 4.7 seems to have the same structure, though perhaps
their are different semantics there.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRefactor map_page_at_addr
Barret Rhoden [Tue, 16 Aug 2016 19:52:04 +0000 (15:52 -0400)]
Refactor map_page_at_addr

The pte_is_mapped() case was a little sketchy.  The page_is_pagemap() check
was a little hard to follow; it's easier if the caller tells us what to do,
instead of us inferring what to do.

This also fixes a memory leak in __hpf, where if we failed to map a
non-page-map page, we neglected to free it.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix bounds checks and misc errors in mm.c
Barret Rhoden [Tue, 16 Aug 2016 20:23:41 +0000 (16:23 -0400)]
Fix bounds checks and misc errors in mm.c

Some of the UMAPTOP checks could be overflowed.  There are probably more
throughout the kernel (though not for UMAPTOP).  Using the umem helper
simplifies the logic a bit.

For those curious, mprotect()s ENOMEM errno is what the man page says to
do, even though the others do EINVAL.

The printk change for create_vmr's failure is in the hopes of catching a
bug.  I occasionally see this:

cs has not created #srv/cs yet, spinning until it does....

kernel warning at kern/src/mm.c:103, from core 0: Not making a VMR,
        wanted 0x0000400000000000, + 0x00003b5100001000 = 0x00007b5100001000

[kernel] do_mmap() aborted for 0x0000400000000000 + 4096!

The do_mmap()'s printk would have truncated the top part of len (0x3b51),
if it was passed.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove SYS_cache_buster (XCC)
Barret Rhoden [Tue, 16 Aug 2016 17:59:56 +0000 (13:59 -0400)]
Remove SYS_cache_buster (XCC)

"You're killing me, Buster."

Reinstall your kernel headers.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix extra decref of shared_page
Barret Rhoden [Tue, 16 Aug 2016 17:55:35 +0000 (13:55 -0400)]
Fix extra decref of shared_page

We should never be freeing shared_page once it is allocated.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoMake page_insert() consume the caller's refcnt
Barret Rhoden [Tue, 16 Aug 2016 17:48:49 +0000 (13:48 -0400)]
Make page_insert() consume the caller's refcnt

The page refcounting needs to go.  The refcnt was from a time when a page
could have multiple objects tracking it independently.  Nowadays that is
handled higher up, such as in the page cache.

For the most part, the freeing/allocating of the memory is handled higher
up in the stack.  We were already doing this with e.g. procinfo, where we
would free it twice, doing double the work necessary.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix the remaining /dev/ -> /dev_vfs/
Barret Rhoden [Wed, 26 Oct 2016 23:50:33 +0000 (19:50 -0400)]
Fix the remaining /dev/ -> /dev_vfs/

Probably the last ones.  =)

This only affected you if you attempted to build the ancient EXT2
support.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoVMM: Fix virtio-net bytestostrip initialization
Barret Rhoden [Mon, 28 Nov 2016 15:24:28 +0000 (10:24 -0500)]
VMM: Fix virtio-net bytestostrip initialization

Needs to be initialized in per-virtio loop.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix slow uthread context switches
Barret Rhoden [Tue, 8 Nov 2016 16:16:44 +0000 (11:16 -0500)]
Fix slow uthread context switches

The lock addq is accessing 8 bytes, but we only need to access one byte.
Accessing 8 bytes could span a cacheline boundary, which it does currently.
Doing so causes two cache misses!

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agovmm: allow a vmm to override the vmcall function
Ronald G. Minnich [Tue, 29 Nov 2016 01:23:45 +0000 (17:23 -0800)]
vmm: allow a vmm to override the vmcall function

Add a vmcall struct to the guest thread struct. This
allows us, on a guest thread by guest thread basis, to
support vmcalls.

I've tested this with dune and it works fine.
Longer term, we may want to define an ops structure
but I think that's rushing it a bit.

Change-Id: Ic381f0e70946ba2396303e5d6428bc999ec4b6dd
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agovmx: Add and use constants for PML and TSC Scaling
Fergus Simpson [Tue, 29 Nov 2016 01:35:39 +0000 (17:35 -0800)]
vmx: Add and use constants for PML and TSC Scaling

This adds definitions for secondary processor-based VM-Execution
controls "Enable PML" and "TSC Scaling".

The need for attempting to unset Enable PML was discovered on a
Broadwell-DE system and TSC Scaling was previously an undocumented
constant.

Change-Id: If4eec1f43da084d6f1c3764c31f7075a9f5605d3
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRewrite _rock_get_listen_fd to make it simpler. (XCC)
Dan Cross [Mon, 21 Nov 2016 20:57:11 +0000 (15:57 -0500)]
Rewrite _rock_get_listen_fd to make it simpler. (XCC)

Use `strrchr` to find the last '/' in the source string when
finding the 'ctl' component.  More verbose error reporting and
assertions.

Rebuild glibc.

Change-Id: I176170d96130403b1e2fa42506caa50a02712e32
Signed-off-by: Dan Cross <crossd@gmail.com>
[xcc warning]
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix random checkpatch warnings and errors in plan9_sockets.c.
Dan Cross [Mon, 21 Nov 2016 20:56:07 +0000 (15:56 -0500)]
Fix random checkpatch warnings and errors in plan9_sockets.c.

Change-Id: I80b4310e76f84a57cab01045b741a199148e1b51
Signed-off-by: Dan Cross <crossd@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agouser/vmm: fflush stdout on every write
Ronald G. Minnich [Tue, 22 Nov 2016 00:01:03 +0000 (16:01 -0800)]
user/vmm: fflush stdout on every write

Things are not reliable enough yet to assume a final fflush
on stdout will happen. Just fflush on every character.

Change-Id: Ib24b6844205849b7d50882ff1724bd46a19ba4b3
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agodune: clean up and remove lots of cruft
Ronald G. Minnich [Mon, 21 Nov 2016 17:22:20 +0000 (09:22 -0800)]
dune: clean up and remove lots of cruft

Dune was derived from test programs first written in early 2015.
We might as well take the opportunity to decruft it.

Change-Id: I955f3f64ab3e387d9f093f5fea158fa3c1d4c8e9
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agodune: add a dune command
Ronald G. Minnich [Mon, 21 Nov 2016 17:03:50 +0000 (09:03 -0800)]
dune: add a dune command

This is much like the Stanford dune system in that it is
designed to run simple non-kernels that support user
mode programs. It lets us show the ease of implementation
of such a command in the Akaros VM model.

To start, this is just a clone of vmrunkernel.

Change-Id: I2ac0fdddd3e834e6d9ea06d75c166a60d1fb4775
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agouser/vmm: print the RSP as well as RIP
Ronald G. Minnich [Mon, 21 Nov 2016 16:55:13 +0000 (08:55 -0800)]
user/vmm: print the RSP as well as RIP

Change-Id: I2f3df21c7a68dd3bde7142b6ba4f255ad62ad9f7
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoSimplify block_alloc function
Fergus Simpson [Thu, 17 Nov 2016 17:35:33 +0000 (09:35 -0800)]
Simplify block_alloc function

Removed optimization from Plan 9 where the driver would attempt to make
use of extra memory reserved by malloc. Akaros does not currently have
the capability to get the real size of the reserved memory, so leaving
the optimization in just resulted in some complicated pointer arithmetic
that always yielded the defined constant Hdrspc.

The optimization has been left in comments in case Akaros ever gets the
ability to get the actual size of reserved memory.

Also added an assert that Hdrspc is aligned to BLOCKALIGN - if it were
not then Hdrspc would randomly be truncated by up to Hdrspc%BLOCKALIGN
bytes.

Change-Id: I5249df6fdd8f47f0f07b35fcf3f7fed45f61d383
Signed-off-by: Fergus Simpson <afergs@google.com>
[removed mlx4 references]
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix virtio net handling of the header.
Gan Shun [Thu, 17 Nov 2016 18:49:18 +0000 (10:49 -0800)]
Fix virtio net handling of the header.

We weren't stripping the header off correctly, and we didn't handle the
case where the guest would use a separate iov for the virtio net header.
This commit properly finds the offset where the ethernet frame begins
and writes that to the NIC.

Signed-off-by: Gan Shun <ganshun@gmail.com>
Change-Id: I6a2ad870d00752a60386bfde8b7b01287f95899d
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agovmrunkernel: add option to set the stack
Ronald G. Minnich [Tue, 15 Nov 2016 00:01:52 +0000 (16:01 -0800)]
vmrunkernel: add option to set the stack

In most cases we don't want to set the stack,
but add an option so we can set it it needed.

Change-Id: I686211b723acfe6efc86a4fc01c1c89c52659d70
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agovmrunkernel: get rid of coreboot tables
Ronald G. Minnich [Mon, 14 Nov 2016 23:53:28 +0000 (15:53 -0800)]
vmrunkernel: get rid of coreboot tables

Maybe we will need them someday but not now.

Change-Id: Ib731eef45a43f6059c1c9fbf8918b771814ca723
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agovmrunkernel: allow -M for setting memory start
Ronald G. Minnich [Mon, 14 Nov 2016 23:18:12 +0000 (15:18 -0800)]
vmrunkernel: allow -M for setting memory start

And, as part of finding compiler warnings to make
this work, do some cleanup.

Oh, and as part seeing the help message
was woefully wrong, fix that too by having it
print the contents of the options struct,
not a string that will keep getting wrong :-)

Change-Id: I98b25095ff2f1255afbf1257d56197b1f6bc8d08
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
[formatting nits]
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdding documentation for using adt and gerrit to Contributing.md
Gan Shun [Wed, 9 Nov 2016 22:36:17 +0000 (14:36 -0800)]
Adding documentation for using adt and gerrit to Contributing.md

Signed-off-by: Gan Shun <ganshun@gmail.com>
Change-Id: I4b58d08e88d570c1d237f7cb14ef79fd21654940
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agovmrunkernel: remove statically allocated _kernel[]
Ronald G. Minnich [Thu, 3 Nov 2016 20:05:49 +0000 (13:05 -0700)]
vmrunkernel: remove statically allocated _kernel[]

kernel memory is now dynamically allocated.
It always starts at 16 MiB, a good choice for linux.
It defaults to 1GiB but you can change the size
via -m.

The startup code makes sure that __procinfo.program_end
is < 16 MiB, and that 16 MiB + memsize does not intrude into
BRK_START.

We also don't use MAP_FIXED. Rather, we test after
the mmap that we got the address we want. This
ensures that we got our mapping and that we did
not get it at the expense of unmapping something else.
It's a more conservative test than using MAP_FIXED
and testing for MAP_FAILED.

Tested to booting a linux kernel.

Change-Id: I6dc2c8e729f27c143e38f53a229e84ab145fb051
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdded ADT script
Gan Shun [Wed, 2 Nov 2016 17:35:23 +0000 (10:35 -0700)]
Added ADT script

This allows us to easily push to gerrit and set up custom reviewers and
topics. The topic defaults to the local branch name unless otherwise
specified.

Signed-off-by: Gan Shun <ganshun@gmail.com>
Change-Id: I841ed157ef6d663d718368652654b0b6039bdc7a
[removed blank at EOF]
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agovmrunkernel: load the file using the ELF library
Ronald G. Minnich [Tue, 1 Nov 2016 16:41:38 +0000 (09:41 -0700)]
vmrunkernel: load the file using the ELF library

This has been used to boot a full Linux kernel environment
to multiuser.

Change-Id: I9ba0ef062f05994225358e92a24de2d7934c8cd9
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd a script to help review gerrit patch sets
Barret Rhoden [Tue, 1 Nov 2016 17:36:56 +0000 (13:36 -0400)]
Add a script to help review gerrit patch sets

Like git track-review, this grabs a branch for a gerrit change (with git
gerrit-track), extracts it into patches, and runs checkpatch.

It could just as easily call git checkpatch, but breaking it into .patches
helps a little.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoReorder the top level Makefile so that full builds work again
Ronald G. Minnich [Mon, 31 Oct 2016 22:56:28 +0000 (15:56 -0700)]
Reorder the top level Makefile so that full builds work again

Otherwise, they fail, as gelf.h is not installed when
make tests runs.

Change-Id: If19d8515706a7a43ccd37bf2e60fbf88ce4cd581
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoDocfix: changed obj/kernel to obj/kern
Fergus Simpson [Tue, 1 Nov 2016 00:43:30 +0000 (17:43 -0700)]
Docfix: changed obj/kernel to obj/kern

There is no kernel folder in obj, just kern.

Change-Id: Id3f901fc0c347cb5e0c5fa220ce83f7338199770
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd script to track a particular gerrit change
Gan Shun [Fri, 28 Oct 2016 19:30:27 +0000 (12:30 -0700)]
Add script to track a particular gerrit change

This script pulls the latest patch-set from gerrit for a particular
change and creates a branch

Signed-off-by: Gan Shun <ganshun@gmail.com>
Change-Id: I56120268935ecca38b978a8e519d5cab6430e70f

3 years agoMove the BRK_START to a fixed, safe address (XCC)
Barret Rhoden [Wed, 26 Oct 2016 19:19:07 +0000 (15:19 -0400)]
Move the BRK_START to a fixed, safe address (XCC)

The VM code often wants to mmap blobs at various fixed addresses, such
as the guest kernel.  Our old glibc heap would start right at the top of
the program's loading point, which meant that we couldn't safely use any
of that memory.  The current vmrunkernel just has a huge array that
covers the memory regions it expects to use.  This is less than ideal.

This commit just specifies a region of the process's virtual address
space that glibc will use for its sbrk() allocations (e.g. malloc()).
Any program can safely mmap with MAP_FIXED below this address (up to the
binary's end point, which the kernel reports in procinfo->program_end.

Here's a before and after.  Note the old 0x21000 bytes has moved from
0x647000 to its new location at 0x100000000000.

bash-4.3$ cat /proc/self/maps
00100000-00120000 rwxp 00000000 01:00 146 /lib/ld-2.19.so
00320000-00321000 r--p 00020000 01:00 146 /lib/ld-2.19.so
00321000-00322000 rw-p 00021000 01:00 146 /lib/ld-2.19.so
00322000-00323000 rw-p 00000000 00:00 0 [heap]
00400000-00443000 r-x- 00000000 01:00 102 /bin/busybox
00443000-00444000 r-xp 00043000 01:00 102 /bin/busybox
00643000-00644000 rw-p 00043000 01:00 102 /bin/busybox
00644000-00647000 rw-p 00000000 00:00 0 [heap]
00647000-00668000 rwx- 00000000 00:00 0 [heap]
400000000000-400000001000 rw-p 00000000 01:00 146 /lib/ld-2.19.so
400000001000-400000002000 rw-p 00000000 00:00 0 [heap]
400000002000-400000141000 r-xp 00000000 01:00 182 /lib/libc-2.19.so
400000141000-400000341000 ---p 0013f000 01:00 182 /lib/libc-2.19.so
400000341000-400000345000 r--p 0013f000 01:00 182 /lib/libc-2.19.so
400000345000-400000347000 rw-p 00143000 01:00 182 /lib/libc-2.19.so
400000347000-40000034a000 rw-p 00000000 00:00 0 [heap]
40000034a000-40000034b000 rw-p 00000000 00:00 0 [heap]
40000034b000-40000034f000 rw-- 00000000 00:00 0 [heap]
40000034f000-400000351000 rwx- 00000000 00:00 0 [heap]
400000351000-400000353000 rw-- 00000000 00:00 0 [heap]
7f7fff8ff000-7f7fff9ff000 rw-- 00000000 00:00 0 [heap]

bash-4.3$ cat /proc/self/maps
00100000-00120000 rwxp 00000000 01:00 146 /lib/ld-2.19.so
00320000-00321000 r--p 00020000 01:00 146 /lib/ld-2.19.so
00321000-00322000 rw-p 00021000 01:00 146 /lib/ld-2.19.so
00322000-00323000 rw-p 00000000 00:00 0 [heap]
00400000-00443000 r-x- 00000000 01:00 102 /bin/busybox
00443000-00444000 r-xp 00043000 01:00 102 /bin/busybox
00643000-00644000 rw-p 00043000 01:00 102 /bin/busybox
00644000-00647000 rw-p 00000000 00:00 0 [heap]
100000000000-100000021000 rwx- 00000000 00:00 0 [heap]
400000000000-400000001000 rw-p 00000000 01:00 146 /lib/ld-2.19.so
400000001000-400000002000 rw-p 00000000 00:00 0 [heap]
400000002000-400000141000 r-xp 00000000 01:00 182 /lib/libc-2.19.so
400000141000-400000341000 ---p 0013f000 01:00 182 /lib/libc-2.19.so
400000341000-400000345000 r--p 0013f000 01:00 182 /lib/libc-2.19.so
400000345000-400000347000 rw-p 00143000 01:00 182 /lib/libc-2.19.so
400000347000-40000034a000 rw-p 00000000 00:00 0 [heap]
40000034a000-40000034b000 rw-p 00000000 00:00 0 [heap]
40000034b000-40000034f000 rw-- 00000000 00:00 0 [heap]
40000034f000-400000351000 rwx- 00000000 00:00 0 [heap]
400000351000-400000353000 rw-- 00000000 00:00 0 [heap]
7f7fff8ff000-7f7fff9ff000 rw-- 00000000 00:00 0 [heap]

Rebuild glibc.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove proc->heap_top
Barret Rhoden [Wed, 26 Oct 2016 19:40:40 +0000 (15:40 -0400)]
Remove proc->heap_top

Unused.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove init=/bin/sh from vmimage_cmdine
Gan Shun [Wed, 26 Oct 2016 18:08:58 +0000 (11:08 -0700)]
Remove init=/bin/sh from vmimage_cmdine

We no longer need that to boot all the time, so I'm removing it from the
defaults.

Signed-off-by: Gan Shun <ganshun@gmail.com>
Change-Id: I18e7023475a8de4abf0588dbc7c298ccd6632e89
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoDelete unsupported entries for userspace MSR handling.
Gan Shun [Wed, 26 Oct 2016 18:08:57 +0000 (11:08 -0700)]
Delete unsupported entries for userspace MSR handling.

We can't handle most of these emulated MSRs because we don't actually read
and write the MSR in userspace. Removing them from the emmsr array

Signed-off-by: Gan Shun <ganshun@gmail.com>
Change-Id: I127adf7ef346df7a5aeb3959b4b41afc25921c49
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix IA32_MISCENABLE disabling of PEBS
Gan Shun [Wed, 26 Oct 2016 18:08:56 +0000 (11:08 -0700)]
Fix IA32_MISCENABLE disabling of PEBS

We weren't correctly checking the written value. We tell the guest that
PEBS is disabled, thus when they write the same value back to the MSR, we
should check for the disable bit in miscenable

Signed-off-by: Gan Shun <ganshun@gmail.com>
Change-Id: I0e00119d7fec678e2c4e3b2185565444022ac140
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAHCI: Prevent sign extension of partial address
Fergus Simpson [Thu, 20 Oct 2016 19:00:55 +0000 (12:00 -0700)]
AHCI: Prevent sign extension of partial address

Drive reads were not working past the 1 TiB mark because the resulting
address was negative. This was determined to be an issue with an
unsigned char getting sign extended when bit shifted into an int64_t.
It is now cast to a uint32_t after the shift to prevent sign extension.
The container was also changed from int64_t to uint64_t.

Change-Id: I590b0da4fd0c02b0e2542a0b65bde510bba89525
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAHCI: Skip device permission check
Fergus Simpson [Tue, 18 Oct 2016 00:11:26 +0000 (17:11 -0700)]
AHCI: Skip device permission check

Permissions are not currently implemented in Akaros. This change simply
makes it so that devpermcheck(...) always returns before throwing an
error that would result in permission being denied. This allows the
device to actually be used while even though permissions have not been
implemented.

Change-Id: Ic2f19071803bba497d916031a22bcbc0b70e8ffd
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAHCI: Add C600 HBA and fix PCI iteration bugs
Fergus Simpson [Tue, 18 Oct 2016 00:09:30 +0000 (17:09 -0700)]
AHCI: Add C600 HBA and fix PCI iteration bugs

This commit adds the C600 HBA to the list of recognized Intel HBAs so my
machine with one can use it and makes two fixes to the PCI driver.

It also fixed a bug in the PCI driver. When detecting PCI devices it
iterates over all functions on all devices. A device can have up to 8
functions (0-7) and the driver assumes they are sequential, giving up
when one is not found. This should not be done. A device is detected by
whether function 0 is implemented - if it is not no device is connected.
While a device must implement function 0, it does not need to implement
its other functions sequentially. The C600 for example implements 0, 2,
3, so the driver did not detect functions 2 and 3 and the HBA did not
work. The driver has been changed so that it will only give up if
function 0 is not found.

Another issue was fixed with the PCI driver where it would not detect
devices on bus 0xff - the last bus. There was a comment about issues
with bus 0xff but that doesn't seem to be an issue any more so the
driver will now check the last bus.

Change-Id: I8dcac3f27b4983a9141e5700d73a758389cef75a
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAHCI: Replace MMIO accesses with helper functions
Fergus Simpson [Mon, 17 Oct 2016 22:08:51 +0000 (15:08 -0700)]
AHCI: Replace MMIO accesses with helper functions

This commit removes all pointer accesses to MMIO by removing the
structs that represented MMIO. They have been replaced by helper
functions that use volatile accesses to make sure that the reads
and writes always happen. Instead of structs, blocks of memory are
simply used that are indexed into using constants that represent each
register. All virtual addresses are represented by void pointers, and
all physical addresses would be represented by uintptr_t types; however,
all physical addresses are stored in MMIO and hence are only accessed
with the provided helper functions. They are only written to as
addresses needed by the HBA. The host keeps its own references to the
structs as void pointers to virtual memory.

Change-Id: Ia62cd57797ca8db9f21f47559c524149ad6fc11e
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAHCI: Fix hardware address gets in driver
Fergus Simpson [Mon, 17 Oct 2016 21:55:51 +0000 (14:55 -0700)]
AHCI: Fix hardware address gets in driver

The AHCI driver was using PCIWADDR(ptr) to get the physical address of
memory mapped structs, but only assinging it to the lower 32 bits of
any address field and setting the upper 32-bits to 0. AHCI's memory
mapped structs use 32-bit regsters so both halves are stored in
sequential registers.

This fix uses paddr_low32(ptr) and paddr_hgih32(ptr) to get both halves
of the address.

This should fix issues that occur when a memory mapped struct is outside
of the 32-bit address space.

Change-Id: I8e5ef62c580cc002510ccabadef9c2fcf0153bc8
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAHCI: Remove struct typedefs from driver
Fergus Simpson [Mon, 17 Oct 2016 21:54:12 +0000 (14:54 -0700)]
AHCI: Remove struct typedefs from driver

This makes the driver more consistent with the rest of Akaros's code.

Change-Id: I427e439ee1b34a2bcf5ec86c94e4f590c0681ee5
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAHCI: get it to build and almost work.
Ronald G. Minnich [Mon, 17 Oct 2016 21:36:44 +0000 (14:36 -0700)]
AHCI: get it to build and almost work.

In qemu, it still shows the device as having zero bytes.
But it does find it.

Change-Id: I81262b460a9cd43a848c1d782c109ec216afb795
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Fergus Simpson <afergs@google.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agomlx4: /dev/ -> /dev_vfs/
Barret Rhoden [Tue, 18 Oct 2016 18:21:57 +0000 (14:21 -0400)]
mlx4: /dev/ -> /dev_vfs/

Fixes the mlx4 driver in accordance with commit 9724f9a56650 ("Move VFS
/dev/ -> /dev_vfs/").

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoConvert the capability device to use SHA1
Ronald G. Minnich [Fri, 14 Oct 2016 22:43:03 +0000 (15:43 -0700)]
Convert the capability device to use SHA1

This involves a minor code change but I take the opportunity
to clean things up, getting rid of files we don't need,
and fixing includes.

Change-Id: Ie9ead4b6a2473d2f25b7b0a777343aef598f8dd9
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agocapability device: get it to compile
Ronald G. Minnich [Fri, 14 Oct 2016 20:26:18 +0000 (13:26 -0700)]
capability device: get it to compile

We need to do something about the use of sha1.

Change-Id: I80795609ccea1ac629cb7b9d4a95040cc040d76a
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
[whitespace]
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agocapability: run scripts/PLAN9 on capability device
Ronald G. Minnich [Fri, 14 Oct 2016 20:26:17 +0000 (13:26 -0700)]
capability: run scripts/PLAN9 on capability device

Change-Id: I55dbed3e636730c4768c61168f22c61c9e2c82fb
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
[sizeof ()s]
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agocapability: clang-format the capability device
Ronald G. Minnich [Fri, 14 Oct 2016 20:26:16 +0000 (13:26 -0700)]
capability: clang-format the capability device

Change-Id: I3e99b8317fc57fbfb775fd4242e5fb2f36411a46
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoadd the capability device from Harvey (from Plan 9)
Ronald G. Minnich [Fri, 14 Oct 2016 20:26:15 +0000 (13:26 -0700)]
add the capability device from Harvey (from Plan 9)

Change-Id: If159e72517809eedd0d1e98271e3dde57e035090
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agocrypto: get sha256 support to build.
Ronald G. Minnich [Thu, 13 Oct 2016 20:39:04 +0000 (13:39 -0700)]
crypto: get sha256 support to build.

For now we'll just go with the sh256.c. That said,
we'll keep the other bits in here. Sooner or later we may
need the other crypto functions. Note these are not compiled
in conditionally.

We should consider removing the conditional compiling
of the unrolled code; we don't have space constraints of firmware.

Change-Id: Ic792cf2b89fa4f01a94c420eb3c620b62c7bf2a9
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agocrypto: move includes to kern/include
Ronald G. Minnich [Thu, 13 Oct 2016 20:39:03 +0000 (13:39 -0700)]
crypto: move includes to kern/include

Change-Id: Id9e62496bb6595a7f282dfa26bd1fa1cbdac8bb4
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agocrypto: initial import of the chromeos vboot libraries
Ronald G. Minnich [Thu, 13 Oct 2016 20:39:02 +0000 (13:39 -0700)]
crypto: initial import of the chromeos vboot libraries

This code is needed to support the capability device, imported
in a separate commit. This is recommended as a 'best' version
of these algorithms by a security expert at Google.

This is from  https://chromium.googlesource.com/chromiumos/platform/vboot_reference
ref 3b55afa94e84c91874fcdad352b4053036886aa7

Change-Id: Ie3d90f183df990fd5bde6dfd83efbbd1e9b6009b
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoifconfig: invoke ipconfig with `-P` when configuring loopback.
Dan Cross [Fri, 7 Oct 2016 20:18:30 +0000 (16:18 -0400)]
ifconfig: invoke ipconfig with `-P` when configuring loopback.

Don't overwrite cached data from the DHCP server.

Change-Id: Ie7d3ad4be5d9cf6aeb4def7d8c47ffefe522c80d
Signed-off-by: Dan Cross <crossd@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix a minor bug in `ipfconfig` and clean up some logic.
Dan Cross [Fri, 7 Oct 2016 20:13:20 +0000 (16:13 -0400)]
Fix a minor bug in `ipfconfig` and clean up some logic.

If the lease time is 1, then we wouldn't wait; that's a bug.
Clean up an obnoxious conditional.

Change-Id: I25ad3c5ac3510d56a0dc3d37b464ca002236875b
Signed-off-by: Dan Cross <crossd@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove our glibc poll implementation (XCC)
Barret Rhoden [Fri, 1 Jul 2016 20:25:29 +0000 (16:25 -0400)]
Remove our glibc poll implementation (XCC)

The old one just immediately returned.  Now that we have a version of
poll() in iplib, that one would have overridden glibc's.  However, if we
messed up and didn't link with iplib, then we'd silently be using the old
broken glibc version again.  This way, we'll catch it with a stub warning.

Rebuild glibc.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoLink busybox with iplib
Barret Rhoden [Thu, 7 Jul 2016 16:37:57 +0000 (12:37 -0400)]
Link busybox with iplib

It needs it to use our poll, instead of glibc's - which will soon be a
stub.

Rebuild busybox.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd nonblocking reads and FD taps to #cons/stdin
Barret Rhoden [Wed, 5 Oct 2016 21:02:04 +0000 (17:02 -0400)]
Add nonblocking reads and FD taps to #cons/stdin

This allows select/poll/epoll of stdin, which a few apps want to do.  The
change to consstat is so that select() can detect if the console is
readable.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd a devstat helper
Barret Rhoden [Thu, 6 Oct 2016 18:44:37 +0000 (14:44 -0400)]
Add a devstat helper

Devices can use this if they want to do build their own stat functions,
given a dir.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoMove VFS /dev/ -> /dev_vfs/
Barret Rhoden [Wed, 5 Oct 2016 20:22:41 +0000 (16:22 -0400)]
Move VFS /dev/ -> /dev_vfs/

Now that stdin/out/err are not in the VFS, we can move the VFS device
directory and get rid of that nasty sys_open() hack.

Code that uses devices, such as the mlx4 user-driver, need to look in
dev_vfs now.  mlx4 and blockdev still use devfs.c.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoMove stdin/stdout/stderr to #cons
Barret Rhoden [Wed, 5 Oct 2016 19:47:24 +0000 (15:47 -0400)]
Move stdin/stdout/stderr to #cons

It's the same logic, just accessible via 9ns instead of VFS.  #cons/null
already existed.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoRemove the old console input code; use qio
Barret Rhoden [Wed, 5 Oct 2016 16:16:41 +0000 (12:16 -0400)]
Remove the old console input code; use qio

This removes all of console.{c,h}, replacing its functionality with a
basic qio queue in devcons.

Other than using qio instead of the homebrewed rings and sems, this
also uses qiwrite directly from interrupt context.  This avoids an
excessive kernel message.

There were also a couple monitor-related commands sitting around in
console.{c.h}, which I moved to monitor files.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix a few debugging tools
Barret Rhoden [Tue, 4 Oct 2016 19:35:18 +0000 (15:35 -0400)]
Fix a few debugging tools

These are minor changes that helped with debugging.  The asserts in qio are
an attempt to debug I panic I got only once.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoqio: Only fire writable taps on edge transitions
Barret Rhoden [Thu, 6 Oct 2016 15:56:58 +0000 (11:56 -0400)]
qio: Only fire writable taps on edge transitions

We were firing writable taps (via the qwake_cb()) any time someone read
from a queue.  The effect of this was that applications that tapped their
Qdata FD would see a lot of writable taps firing, even though there wasn't
an edge transition.

For instance, say a conversation's write queue (outbound, TX) is no where
near full.  The app puts a packet in the queue.  When the network stack
drains the block from the queue with __qbread(), that will trigger a
writable tap.  So the app gets an FD tap / epoll every time it writes a
packet.  Incidentally, that behavior helped me track down a bug, but it
isn't what we're looking for.

Like the read side, we only fire on edge transitions, as done in commit
dbaaf4a3029e ("qio: Fire read taps on actual edges").  Back then, I had us
firing writable taps all the time, which was a bit much.

Note that we still fire the readable/writable taps regardless of
Qstarve/Qflow.  Those queue state flags only get set when someone tries to
read/write a queue and fails.  The taps we fire occur independently, which
is why their logic (e.g. was_empty / was_unwritable) are separate from the
rendez control variables (e.g. dowakeup).  This is probably right, since
it's possible for an application to know a queue would block without trying
(perhaps through stat()).

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoqio: Add a check to pullupblock
Barret Rhoden [Tue, 4 Oct 2016 19:32:58 +0000 (15:32 -0400)]
qio: Add a check to pullupblock

This delays the impending doom associated with BLOCK_EXTRA_DATA.  It's
relatively easy to trigger the problem if the block len (and block list
len) is < n.  Just write gibberish into a UDP data FD!

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoClose alarm FDs on fork()
Barret Rhoden [Fri, 30 Sep 2016 20:11:51 +0000 (16:11 -0400)]
Close alarm FDs on fork()

If a parent has alarm FDs, forks, but doesn't exec, then its child will
inherit its alarm FDs.  Other than the child being able to mess with the
parent's alarms, which is bad, the parent is unable to fully be freed (as
in __proc_free()) until the child closes the FD - usually by exiting.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix potential overflow error in CEQs (XCC)
Barret Rhoden [Tue, 4 Oct 2016 18:06:06 +0000 (14:06 -0400)]
Fix potential overflow error in CEQs (XCC)

The issue was that a consumer that came in during overflow recovery could
see that there was no overflow and return FALSE, meaning the CEQ was empty,
even though there were older messages.

Consider, the kernel already posted two messages, set overflow, and the
ring is empty:

Thread 1                      Thread 2
--------                      --------
see empty ring                see empty ring
see overflow is on
grab lock
clear overflow
extract a message
                              sees overflow is off
                              returns FALSE
sets overflow
unlocks
returns TRUE

And there's still a message in the CEQ that thread 2 should have grabbed.

While doing this change, I also changed nr_events to an unsigned.  That was
my original intent (based on the usage in epoll), and making the change now
keeps this commit from changing the size of the CEQ, which keeps everyone
from having to rebuild every application.

Reinstall your kernel headers.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAvoid needless TLB flush when restarting kthreads
Barret Rhoden [Mon, 3 Oct 2016 19:27:10 +0000 (15:27 -0400)]
Avoid needless TLB flush when restarting kthreads

If we're about to run on a core where our address space was arleady loaded,
we don't need to reload it.  Doing so actually triggers a TLB flush.

Regarding changing this comment:

/* In the future, we could check owning_proc. If it isn't set, we
 * could clear current and transfer the refcnt to kthread->proc. */

Although that is true, it's a bit dangerous and we'd need to measure to
know if its worth the hassle.  The intent was that if owning_proc !=
current, then this kthread is running 'detached' from its process.  This
could be a syscall that briefly woke up and went back to sleep.  We could
avoid the incref and another decref shortly (when current gets cleared in
smp_idle()->abandon_core()) by transferring the ref.

The issue is that when we clear current, we also need to load a different
page table, since it's possible that the process will be freed before this
core ends up running another page table.  So we could do this optimization,
but then we'd need to load a page table, which is a TLB flush.  Then maybe
we'd be switching back to the process again, since we don't know that the
*next* kthread to run isn't also for this process.  So it's not clear that
avoiding the atomic ops (incref/decref) and moving up the TLB flush was
worth the hassle.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoFix clobber of current in kthread.c
Barret Rhoden [Mon, 3 Oct 2016 19:03:50 +0000 (15:03 -0400)]
Fix clobber of current in kthread.c

Originally, there wasn't a KTH_SAVE_ADDR_SPACE flag.  When I added that, I
didn't update this code.  The resulting bug was that if we had to undo a
kthread swap, that kthread was for a ktask (which doesn't have a proc), and
we had a process's address space loaded, then we'd clobber current
(clearing it).  That would result in a reference counting problem, since we
effectively deleted a counted reference to whatever process was current.
I'd see this on occasion under heavy networking and process load.

This also clears kthread->proc whenever the kthread is not blocked.
Previously, we were leaving the value of the uncounted proc reference.  The
code was okay, but it was surprising when debugging and was a source for
potential bugs.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoDelay clearing owning proc in sys_exec
Barret Rhoden [Fri, 30 Sep 2016 20:30:07 +0000 (16:30 -0400)]
Delay clearing owning proc in sys_exec

If we do it before any of the return calls, we could end up returning to
userspace while owning_proc isn't set.  I think the rest of the kernel is
able to handle this, but there's no sense messing around.  The old comment
makes it sound like we can block in that state too, which is probably true,
but returning by anything other than the error path ways seems like a bad
idea.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoChange syscall usec timeouts to unsigned longs
Barret Rhoden [Fri, 30 Sep 2016 20:20:46 +0000 (16:20 -0400)]
Change syscall usec timeouts to unsigned longs

I noticed this due to some sys_block calls having a 'negative' argument in
the printout (due to the %d in the saved string).

While I was here, I also changed halt_core, though note that that timeout
was more of a 'future plan', I think.  The code doesn't use it.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoUse proc_decref() in #proc
Barret Rhoden [Fri, 30 Sep 2016 18:04:27 +0000 (14:04 -0400)]
Use proc_decref() in #proc

Use the helper instead of accessing the kref directly.  That helps with
debugging.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd trace_printf()
Barret Rhoden [Fri, 30 Sep 2016 16:34:57 +0000 (12:34 -0400)]
Add trace_printf()

This is a helper for userspace to print into the kernel's trace_printk()
log.  It's extremely useful for fast print debugging.

The trace_printk() log currently just maintains the last N entries, with
older entries replaced by newer ones.  You can cat it or tail it
repeatedly.  The log file is usually at /prof/kptrace.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoepoll: Set up the alarm_evq at init time
Barret Rhoden [Wed, 28 Sep 2016 16:21:01 +0000 (12:21 -0400)]
epoll: Set up the alarm_evq at init time

This way we don't need to alloc and free it repeatedly for timeouts.  The
main benefit for this now is that we actually leak memory when we free the
evqs in epoll.c (grep TODO.*INDIR).  This prevents long-running processes
from eventually running out of memory.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoAdd a helper for async syscalls
Barret Rhoden [Wed, 28 Sep 2016 16:18:52 +0000 (12:18 -0400)]
Add a helper for async syscalls

This helper makes an async syscall that will trigger the event queue upon
completion.  The caller doesn't check for completion manually - wait for
the ev_q.  This helps a few async syscall use cases, and avoids the need to
register the evq (CAS and whatnot) after submitting the syscall.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoepoll: Clean up epoll_wait and stop excess polling
Barret Rhoden [Tue, 27 Sep 2016 18:23:05 +0000 (14:23 -0400)]
epoll: Clean up epoll_wait and stop excess polling

The old for loop would keep polling up to maxevents.  As soon as it fails
once, we should stop.  At that moment, the CEQ was empty and we should
either block or return.

Also, this fixes a subtle issue.  If we extracted a message but it didn't
have an epoll event, we were still advancing 'i', which means we'd have a
hole in our events (so, a gibberish event) and we'd skip the last event.
Alas, this was *a* bug, but not the bug I was looking for.

This also cleans up a bit of the logic for after we block, thanks to the
__epoll_wait_poll helper.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoepoll: Fix event clobber
Barret Rhoden [Tue, 27 Sep 2016 15:48:51 +0000 (11:48 -0400)]
epoll: Fix event clobber

It was possible, though I never saw it, for an event entry to be clobbered.
Say we extracted a CEQ message for a particular FD.  That set the events
field in the epoll_event.  Then we attempt to extract another CEQ message,
possibly intending for another FD in the epoll set.  Instead, we get
another message for that same FD that we already set an event for.  When we
set that event, we clobber the original one.

The fix is to accumulate events for a given FD for all CEQ messages.

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agoifconfig: use daemonize for cs, remove busy-waiting loop
Dan Cross [Thu, 6 Oct 2016 19:12:21 +0000 (15:12 -0400)]
ifconfig: use daemonize for cs, remove busy-waiting loop

Now that `cs` understands the daemonize protocol, use it in
the `ifconfig` script.  Remove the busy-waiting loop waiting
for the /srv/cs file to appear.

Change-Id: I06db794b38ad50957c56668f7a8cef807d54101c
Signed-off-by: Dan Cross <crossd@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
3 years agocs.c: Add an option to participate in the daemonize protocol.
Dan Cross [Thu, 6 Oct 2016 19:12:20 +0000 (15:12 -0400)]
cs.c: Add an option to participate in the daemonize protocol.

Add an option to make cs.c participate in the `daemonize` protocol:
it will signal completion by sending an event to it's parent; this
instead of busy-waiting on the creation of a srv file.

Change-Id: Ibd44c7352ca3e71621255db0dac9069178b1f845
Signed-off-by: Dan Cross <crossd@gmail.com>
Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>