VMM: handle EPT page faults
[akaros.git] / kern / arch / x86 / vmm / intel / vmx.c
1 /**
2  *  vmx.c - The Intel VT-x driver for Dune
3  *
4  * This file is derived from Linux KVM VT-x support.
5  * Copyright (C) 2006 Qumranet, Inc.
6  * Copyright 2010 Red Hat, Inc. and/or its affiliates.
7  *
8  * Original Authors:
9  *   Avi Kivity   <avi@qumranet.com>
10  *   Yaniv Kamay  <yaniv@qumranet.com>
11  *
12  * This modified version is simpler because it avoids the following
13  * features that are not requirements for Dune:
14  *  * Real-mode emulation
15  *  * Nested VT-x support
16  *  * I/O hardware emulation
17  *  * Any of the more esoteric X86 features and registers
18  *  * KVM-specific functionality
19  *
20  * In essence we provide only the minimum functionality needed to run
21  * a process in vmx non-root mode rather than the full hardware emulation
22  * needed to support an entire OS.
23  *
24  * This driver is a research prototype and as such has the following
25  * limitations:
26  *
27  * FIXME: Backward compatability is currently a non-goal, and only recent
28  * full-featured (EPT, PCID, VPID, etc.) Intel hardware is supported by this
29  * driver.
30  *
31  * FIXME: Eventually we should handle concurrent user's of VT-x more
32  * gracefully instead of requiring exclusive access. This would allow
33  * Dune to interoperate with KVM and other HV solutions.
34  *
35  * FIXME: We need to support hotplugged physical CPUs.
36  *
37  * Authors:
38  *   Adam Belay   <abelay@stanford.edu>
39  */
40
41 /* Basic flow.
42  * Yep, it's confusing. This is in part because the vmcs is used twice, for two different things.
43  * You're left with the feeling that they got part way through and realized they had to have one for
44  *
45  * 1) your CPU is going to be capable of running VMs, and you need state for that.
46  *
47  * 2) you're about to start a guest, and you need state for that.
48  *
49  * So there is get cpu set up to be able to run VMs stuff, and now
50  * let's start a guest stuff.  In Akaros, CPUs will always be set up
51  * to run a VM if that is possible. Processes can flip themselves into
52  * a VM and that will require another VMCS.
53  *
54  * So: at kernel startup time, the SMP boot stuff calls
55  * k/a/x86/vmm/vmm.c:vmm_init, which calls arch-dependent bits, which
56  * in the case of this file is intel_vmm_init. That does some code
57  * that sets up stuff for ALL sockets, based on the capabilities of
58  * the socket it runs on. If any cpu supports vmx, it assumes they all
59  * do. That's a realistic assumption. So the call_function_all is kind
60  * of stupid, really; it could just see what's on the current cpu and
61  * assume it's on all. HOWEVER: there are systems in the wilde that
62  * can run VMs on some but not all CPUs, due to BIOS mistakes, so we
63  * might as well allow for the chance that wel'll only all VMMCPs on a
64  * subset (not implemented yet however).  So: probe all CPUs, get a
65  * count of how many support VMX and, for now, assume they all do
66  * anyway.
67  *
68  * Next, call setup_vmcs_config to configure the GLOBAL vmcs_config struct,
69  * which contains all the naughty bits settings for all the cpus that can run a VM.
70  * Realistically, all VMX-capable cpus in a system will have identical configurations.
71  * So: 0 or more cpus can run VMX; all cpus which can run VMX will have the same configuration.
72  *
73  * configure the msr_bitmap. This is the bitmap of MSRs which the
74  * guest can manipulate.  Currently, we only allow GS and FS base.
75  *
76  * Reserve bit 0 in the vpid bitmap as guests can not use that
77  *
78  * Set up the what we call the vmxarea. The vmxarea is per-cpu, not
79  * per-guest. Once set up, it is left alone.  The ONLY think we set in
80  * there is the revision area. The VMX is page-sized per cpu and
81  * page-aligned. Note that it can be smaller, but why bother? We know
82  * the max size and alightment, and it's convenient.
83  *
84  * Now that it is set up, enable vmx on all cpus. This involves
85  * testing VMXE in cr4, to see if we've been here before (TODO: delete
86  * this test), then testing MSR_IA32_FEATURE_CONTROL to see if we can
87  * do a VM, the setting the VMXE in cr4, calling vmxon (does a vmxon
88  * instruction), and syncing vpid's and ept's.  Now the CPU is ready
89  * to host guests.
90  *
91  * Setting up a guest.
92  * We divide this into two things: vmm_proc_init and vm_run.
93  * Currently, on Intel, vmm_proc_init does nothing.
94  *
95  * vm_run is really complicated. It is called with a coreid, rip, rsp,
96  * cr3, and flags.  On intel, it calls vmx_launch. vmx_launch is set
97  * up for a few test cases. If rip is 1, it sets the guest rip to
98  * a function which will deref 0 and should exit with failure 2. If rip is 0,
99  * it calls an infinite loop in the guest.
100  *
101  * The sequence of operations:
102  * create a vcpu
103  * while (1) {
104  * get a vcpu
105  * disable irqs (required or you can't enter the VM)
106  * vmx_run_vcpu()
107  * enable irqs
108  * manage the vm exit
109  * }
110  *
111  * get a vcpu
112  * See if the current cpu has a vcpu. If so, and is the same as the vcpu we want,
113  * vmcs_load(vcpu->vmcs) -- i.e. issue a VMPTRLD.
114  *
115  * If it's not the same, see if the vcpu thinks it is on the core. If it is not, call
116  * __vmx_get_cpu_helper on the other cpu, to free it up. Else vmcs_clear the one
117  * attached to this cpu. Then vmcs_load the vmcs for vcpu on this this cpu,
118  * call __vmx_setup_cpu, mark this vcpu as being attached to this cpu, done.
119  *
120  * vmx_run_vcpu this one gets messy, mainly because it's a giant wad
121  * of inline assembly with embedded CPP crap. I suspect we'll want to
122  * un-inline it someday, but maybe not.  It's called with a vcpu
123  * struct from which it loads guest state, and to which it stores
124  * non-virtualized host state. It issues a vmlaunch or vmresume
125  * instruction depending, and on return, it evaluates if things the
126  * launch/resume had an error in that operation. Note this is NOT the
127  * same as an error while in the virtual machine; this is an error in
128  * startup due to misconfiguration. Depending on whatis returned it's
129  * either a failed vm startup or an exit for lots of many reasons.
130  *
131  */
132
133 /* basically: only rename those globals that might conflict
134  * with existing names. Leave all else the same.
135  * this code is more modern than the other code, yet still
136  * well encapsulated, it seems.
137  */
138 #include <kmalloc.h>
139 #include <string.h>
140 #include <stdio.h>
141 #include <assert.h>
142 #include <error.h>
143 #include <pmap.h>
144 #include <sys/queue.h>
145 #include <smp.h>
146 #include <kref.h>
147 #include <atomic.h>
148 #include <alarm.h>
149 #include <event.h>
150 #include <umem.h>
151 #include <bitops.h>
152 #include <arch/types.h>
153 #include <syscall.h>
154
155 #include "vmx.h"
156 #include "../vmm.h"
157
158 #include "cpufeature.h"
159
160 #define currentcpu (&per_cpu_info[core_id()])
161
162 /*
163  * Keep MSR_STAR at the end, as setup_msrs() will try to optimize it
164  * away by decrementing the array size.
165  */
166 static const uint32_t vmx_msr_index[] = {
167 #ifdef CONFIG_X86_64
168         MSR_SYSCALL_MASK, MSR_LSTAR, MSR_CSTAR,
169 #endif
170         MSR_EFER, MSR_TSC_AUX, MSR_STAR,
171 };
172 #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index)
173
174 static unsigned long *msr_bitmap;
175
176 int x86_ept_pte_fix_ups = 0;
177
178 struct vmx_capability vmx_capability;
179 struct vmcs_config vmcs_config;
180
181 void ept_flush(uint64_t eptp)
182 {
183         ept_sync_context(eptp);
184 }
185
186 static void vmcs_clear(struct vmcs *vmcs)
187 {
188         uint64_t phys_addr = PADDR(vmcs);
189         uint8_t error;
190
191         asm volatile (ASM_VMX_VMCLEAR_RAX "; setna %0"
192                       : "=qm"(error) : "a"(&phys_addr), "m"(phys_addr)
193                       : "cc", "memory");
194         if (error)
195                 printk("vmclear fail: %p/%llx\n",
196                        vmcs, phys_addr);
197 }
198
199 static void vmcs_load(struct vmcs *vmcs)
200 {
201         uint64_t phys_addr = PADDR(vmcs);
202         uint8_t error;
203
204         asm volatile (ASM_VMX_VMPTRLD_RAX "; setna %0"
205                         : "=qm"(error) : "a"(&phys_addr), "m"(phys_addr)
206                         : "cc", "memory");
207         if (error)
208                 printk("vmptrld %p/%llx failed\n",
209                        vmcs, phys_addr);
210 }
211
212 /* Returns the paddr pointer of the current CPU's VMCS region, or -1 if none. */
213 static physaddr_t vmcs_get_current(void)
214 {
215         physaddr_t vmcs_paddr;
216         /* RAX contains the addr of the location to store the VMCS pointer.  The
217          * compiler doesn't know the ASM will deref that pointer, hence the =m */
218         asm volatile (ASM_VMX_VMPTRST_RAX : "=m"(vmcs_paddr) : "a"(&vmcs_paddr));
219         return vmcs_paddr;
220 }
221
222 __always_inline unsigned long vmcs_readl(unsigned long field)
223 {
224         unsigned long value;
225
226         asm volatile (ASM_VMX_VMREAD_RDX_RAX
227                       : "=a"(value) : "d"(field) : "cc");
228         return value;
229 }
230
231 __always_inline uint16_t vmcs_read16(unsigned long field)
232 {
233         return vmcs_readl(field);
234 }
235
236 static __always_inline uint32_t vmcs_read32(unsigned long field)
237 {
238         return vmcs_readl(field);
239 }
240
241 static __always_inline uint64_t vmcs_read64(unsigned long field)
242 {
243 #ifdef CONFIG_X86_64
244         return vmcs_readl(field);
245 #else
246         return vmcs_readl(field) | ((uint64_t)vmcs_readl(field+1) << 32);
247 #endif
248 }
249
250 void vmwrite_error(unsigned long field, unsigned long value)
251 {
252         printk("vmwrite error: reg %lx value %lx (err %d)\n",
253                field, value, vmcs_read32(VM_INSTRUCTION_ERROR));
254 }
255
256 void vmcs_writel(unsigned long field, unsigned long value)
257 {
258         uint8_t error;
259
260         asm volatile (ASM_VMX_VMWRITE_RAX_RDX "; setna %0"
261                        : "=q"(error) : "a"(value), "d"(field) : "cc");
262         if (error)
263                 vmwrite_error(field, value);
264 }
265
266 static void vmcs_write16(unsigned long field, uint16_t value)
267 {
268         vmcs_writel(field, value);
269 }
270
271 static void vmcs_write32(unsigned long field, uint32_t value)
272 {
273         vmcs_writel(field, value);
274 }
275
276 static void vmcs_write64(unsigned long field, uint64_t value)
277 {
278         vmcs_writel(field, value);
279 }
280
281 static int adjust_vmx_controls(uint32_t ctl_min, uint32_t ctl_opt,
282                                       uint32_t msr, uint32_t *result)
283 {
284         uint32_t vmx_msr_low, vmx_msr_high;
285         uint32_t ctl = ctl_min | ctl_opt;
286         uint64_t vmx_msr = read_msr(msr);
287         vmx_msr_low = vmx_msr;
288         vmx_msr_high = vmx_msr>>32;
289
290         ctl &= vmx_msr_high; /* bit == 0 in high word ==> must be zero */
291         ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */
292
293         /* Ensure minimum (required) set of control bits are supported. */
294         if (ctl_min & ~ctl) {
295                 return -EIO;
296         }
297
298         *result = ctl;
299         return 0;
300 }
301
302 static  bool allow_1_setting(uint32_t msr, uint32_t ctl)
303 {
304         uint32_t vmx_msr_low, vmx_msr_high;
305
306         rdmsr(msr, vmx_msr_low, vmx_msr_high);
307         return vmx_msr_high & ctl;
308 }
309
310 static  void setup_vmcs_config(void *p)
311 {
312         int *ret = p;
313         struct vmcs_config *vmcs_conf = &vmcs_config;
314         uint32_t vmx_msr_low, vmx_msr_high;
315         uint32_t min, opt, min2, opt2;
316         uint32_t _pin_based_exec_control = 0;
317         uint32_t _cpu_based_exec_control = 0;
318         uint32_t _cpu_based_2nd_exec_control = 0;
319         uint32_t _vmexit_control = 0;
320         uint32_t _vmentry_control = 0;
321
322         *ret = -EIO;
323         min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
324         opt = PIN_BASED_VIRTUAL_NMIS;
325         if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
326                                 &_pin_based_exec_control) < 0) {
327                 return;
328         }
329
330         min =
331               CPU_BASED_CR8_LOAD_EXITING |
332               CPU_BASED_CR8_STORE_EXITING |
333               CPU_BASED_CR3_LOAD_EXITING |
334               CPU_BASED_CR3_STORE_EXITING |
335               CPU_BASED_MOV_DR_EXITING |
336               CPU_BASED_USE_TSC_OFFSETING |
337               CPU_BASED_MWAIT_EXITING |
338               CPU_BASED_MONITOR_EXITING |
339               CPU_BASED_INVLPG_EXITING;
340
341         min |= CPU_BASED_HLT_EXITING;
342
343         opt = CPU_BASED_TPR_SHADOW |
344               CPU_BASED_USE_MSR_BITMAPS |
345               CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
346         if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
347                                 &_cpu_based_exec_control) < 0) {
348                 return;
349         }
350
351         if ((_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
352                 _cpu_based_exec_control &= ~CPU_BASED_CR8_LOAD_EXITING &
353                                            ~CPU_BASED_CR8_STORE_EXITING;
354
355         if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) {
356                 min2 = 
357                         SECONDARY_EXEC_ENABLE_EPT |
358                         SECONDARY_EXEC_UNRESTRICTED_GUEST;
359                 opt2 =  SECONDARY_EXEC_WBINVD_EXITING |
360                         SECONDARY_EXEC_RDTSCP |
361                         SECONDARY_EXEC_ENABLE_INVPCID;
362                 if (adjust_vmx_controls(min2, opt2,
363                                         MSR_IA32_VMX_PROCBASED_CTLS2,
364                                         &_cpu_based_2nd_exec_control) < 0) {
365                                                 return;
366                                         }
367         }
368
369         if (!(_cpu_based_2nd_exec_control &
370                                 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
371                 _cpu_based_exec_control &= ~CPU_BASED_TPR_SHADOW;
372
373         if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
374                 /* CR3 accesses and invlpg don't need to cause VM Exits when EPT
375                    enabled */
376                 _cpu_based_exec_control &= ~(CPU_BASED_CR3_LOAD_EXITING |
377                                              CPU_BASED_CR3_STORE_EXITING |
378                                              CPU_BASED_INVLPG_EXITING);
379                 rdmsr(MSR_IA32_VMX_EPT_VPID_CAP,
380                       vmx_capability.ept, vmx_capability.vpid);
381         }
382
383         min = 0;
384
385         min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
386
387 //      opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
388         opt = 0;
389         if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
390                                 &_vmexit_control) < 0) {
391                 return;
392         }
393
394         min = 0;
395 //      opt = VM_ENTRY_LOAD_IA32_PAT;
396         opt = 0;
397         if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
398                                 &_vmentry_control) < 0) {
399                 return;
400         }
401
402         rdmsr(MSR_IA32_VMX_BASIC, vmx_msr_low, vmx_msr_high);
403
404         /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
405         if ((vmx_msr_high & 0x1fff) > PAGE_SIZE) {
406                 return;
407         }
408
409         /* IA-32 SDM Vol 3B: 64-bit CPUs always have VMX_BASIC_MSR[48]==0. */
410         if (vmx_msr_high & (1u<<16)) {
411                 printk("64-bit CPUs always have VMX_BASIC_MSR[48]==0. FAILS!\n");
412                 return;
413         }
414
415         /* Require Write-Back (WB) memory type for VMCS accesses. */
416         if (((vmx_msr_high >> 18) & 15) != 6) {
417                 printk("NO WB!\n");
418                 return;
419         }
420
421         vmcs_conf->size = vmx_msr_high & 0x1fff;
422         vmcs_conf->order = LOG2_UP(nr_pages(vmcs_config.size));
423         vmcs_conf->revision_id = vmx_msr_low;
424         printk("vmcs_conf size %d order %d rev %d\n",
425                vmcs_conf->size, vmcs_conf->order,
426                vmcs_conf->revision_id);
427
428         vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
429         vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
430         vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
431         vmcs_conf->vmexit_ctrl         = _vmexit_control;
432         vmcs_conf->vmentry_ctrl        = _vmentry_control;
433
434         vmx_capability.has_load_efer =
435                 allow_1_setting(MSR_IA32_VMX_ENTRY_CTLS,
436                                 VM_ENTRY_LOAD_IA32_EFER)
437                 && allow_1_setting(MSR_IA32_VMX_EXIT_CTLS,
438                                    VM_EXIT_LOAD_IA32_EFER);
439
440         /* Now that we've done all the setup we can do, verify
441          * that we have all the capabilities we need. These tests
442          * are done last presumably because all the work done above
443          * affects some of them.
444          */
445
446         if (!vmx_capability.has_load_efer) {
447                 printk("CPU lacks ability to load EFER register\n");
448                 return;
449         }
450
451         *ret = 0;
452 }
453
454 static struct vmcs *__vmx_alloc_vmcs(int node)
455 {
456         struct vmcs *vmcs;
457
458         vmcs = get_cont_pages_node(node, vmcs_config.order, KMALLOC_WAIT);
459         if (!vmcs)
460                 return 0;
461         memset(vmcs, 0, vmcs_config.size);
462         vmcs->revision_id = vmcs_config.revision_id;    /* vmcs revision id */
463         printd("%d: set rev id %d\n", core_id(), vmcs->revision_id);
464         return vmcs;
465 }
466
467 /**
468  * vmx_alloc_vmcs - allocates a VMCS region
469  *
470  * NOTE: Assumes the new region will be used by the current CPU.
471  *
472  * Returns a valid VMCS region.
473  */
474 static struct vmcs *vmx_alloc_vmcs(void)
475 {
476         return __vmx_alloc_vmcs(node_id());
477 }
478
479 /**
480  * vmx_free_vmcs - frees a VMCS region
481  */
482 static void vmx_free_vmcs(struct vmcs *vmcs)
483 {
484   //free_pages((unsigned long)vmcs, vmcs_config.order);
485 }
486
487 /*
488  * Set up the vmcs's constant host-state fields, i.e., host-state fields that
489  * will not change in the lifetime of the guest.
490  * Note that host-state that does change is set elsewhere. E.g., host-state
491  * that is set differently for each CPU is set in vmx_vcpu_load(), not here.
492  */
493 static void vmx_setup_constant_host_state(void)
494 {
495         uint32_t low32, high32;
496         unsigned long tmpl;
497         pseudodesc_t dt;
498
499         vmcs_writel(HOST_CR0, rcr0() & ~X86_CR0_TS);  /* 22.2.3 */
500         vmcs_writel(HOST_CR4, rcr4());  /* 22.2.3, 22.2.5 */
501         vmcs_writel(HOST_CR3, rcr3());  /* 22.2.3 */
502
503         vmcs_write16(HOST_CS_SELECTOR, GD_KT);  /* 22.2.4 */
504         vmcs_write16(HOST_DS_SELECTOR, GD_KD);  /* 22.2.4 */
505         vmcs_write16(HOST_ES_SELECTOR, GD_KD);  /* 22.2.4 */
506         vmcs_write16(HOST_SS_SELECTOR, GD_KD);  /* 22.2.4 */
507         vmcs_write16(HOST_TR_SELECTOR, GD_TSS);  /* 22.2.4 */
508
509         native_store_idt(&dt);
510         vmcs_writel(HOST_IDTR_BASE, dt.pd_base);   /* 22.2.4 */
511
512         asm("mov $.Lkvm_vmx_return, %0" : "=r"(tmpl));
513         vmcs_writel(HOST_RIP, tmpl); /* 22.2.5 */
514
515         rdmsr(MSR_IA32_SYSENTER_CS, low32, high32);
516         vmcs_write32(HOST_IA32_SYSENTER_CS, low32);
517         rdmsrl(MSR_IA32_SYSENTER_EIP, tmpl);
518         vmcs_writel(HOST_IA32_SYSENTER_EIP, tmpl);   /* 22.2.3 */
519
520         rdmsr(MSR_EFER, low32, high32);
521         vmcs_write32(HOST_IA32_EFER, low32);
522
523         if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT) {
524                 rdmsr(MSR_IA32_CR_PAT, low32, high32);
525                 vmcs_write64(HOST_IA32_PAT, low32 | ((uint64_t) high32 << 32));
526         }
527
528         vmcs_write16(HOST_FS_SELECTOR, 0);            /* 22.2.4 */
529         vmcs_write16(HOST_GS_SELECTOR, 0);            /* 22.2.4 */
530
531         /* TODO: This (at least gs) is per cpu */
532         rdmsrl(MSR_FS_BASE, tmpl);
533         vmcs_writel(HOST_FS_BASE, tmpl); /* 22.2.4 */
534         rdmsrl(MSR_GS_BASE, tmpl);
535         vmcs_writel(HOST_GS_BASE, tmpl); /* 22.2.4 */
536 }
537
538 static inline uint16_t vmx_read_ldt(void)
539 {
540         uint16_t ldt;
541         asm("sldt %0" : "=g"(ldt));
542         return ldt;
543 }
544
545 static unsigned long segment_base(uint16_t selector)
546 {
547         pseudodesc_t *gdt = &currentcpu->host_gdt;
548         struct desc_struct *d;
549         unsigned long table_base;
550         unsigned long v;
551
552         if (!(selector & ~3)) {
553                 return 0;
554         }
555
556         table_base = gdt->pd_base;
557
558         if (selector & 4) {           /* from ldt */
559                 uint16_t ldt_selector = vmx_read_ldt();
560
561                 if (!(ldt_selector & ~3)) {
562                         return 0;
563                 }
564
565                 table_base = segment_base(ldt_selector);
566         }
567         d = (struct desc_struct *)(table_base + (selector & ~7));
568         v = get_desc_base(d);
569 #ifdef CONFIG_X86_64
570        if (d->s == 0 && (d->type == 2 || d->type == 9 || d->type == 11))
571                v |= ((unsigned long)((struct ldttss_desc64 *)d)->base3) << 32;
572 #endif
573         return v;
574 }
575
576 static inline unsigned long vmx_read_tr_base(void)
577 {
578         uint16_t tr;
579         asm("str %0" : "=g"(tr));
580         return segment_base(tr);
581 }
582
583 static void __vmx_setup_cpu(void)
584 {
585         pseudodesc_t *gdt = &currentcpu->host_gdt;
586         unsigned long sysenter_esp;
587         unsigned long tmpl;
588
589         /*
590          * Linux uses per-cpu TSS and GDT, so set these when switching
591          * processors.
592          */
593         vmcs_writel(HOST_TR_BASE, vmx_read_tr_base()); /* 22.2.4 */
594         vmcs_writel(HOST_GDTR_BASE, gdt->pd_base);   /* 22.2.4 */
595
596         rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
597         vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
598
599         rdmsrl(MSR_FS_BASE, tmpl);
600         vmcs_writel(HOST_FS_BASE, tmpl); /* 22.2.4 */
601         rdmsrl(MSR_GS_BASE, tmpl);
602         vmcs_writel(HOST_GS_BASE, tmpl); /* 22.2.4 */
603 }
604
605 /**
606  * vmx_get_cpu - called before using a cpu
607  * @vcpu: VCPU that will be loaded.
608  *
609  * Disables preemption. Call vmx_put_cpu() when finished.
610  */
611 static void vmx_get_cpu(struct vmx_vcpu *vcpu)
612 {
613         int cur_cpu = core_id();
614         handler_wrapper_t *w;
615
616         if (currentcpu->local_vcpu)
617                 panic("get_cpu: currentcpu->localvcpu was non-NULL");
618         if (currentcpu->local_vcpu != vcpu) {
619                 currentcpu->local_vcpu = vcpu;
620
621                 if (vcpu->cpu != cur_cpu) {
622                         if (vcpu->cpu >= 0) {
623                                 panic("vcpu->cpu is not -1, it's %d\n", vcpu->cpu);
624                         } else
625                                 vmcs_clear(vcpu->vmcs);
626
627                         ept_sync_context(vcpu_get_eptp(vcpu));
628
629                         vcpu->launched = 0;
630                         vmcs_load(vcpu->vmcs);
631                         __vmx_setup_cpu();
632                         vcpu->cpu = cur_cpu;
633                 } else {
634                         vmcs_load(vcpu->vmcs);
635                 }
636         }
637 }
638
639 /**
640  * vmx_put_cpu - called after using a cpu
641  * @vcpu: VCPU that was loaded.
642  */
643 static void vmx_put_cpu(struct vmx_vcpu *vcpu)
644 {
645         if (core_id() != vcpu->cpu)
646                 panic("%s: core_id() %d != vcpu->cpu %d\n",
647                       __func__, core_id(), vcpu->cpu);
648
649         if (currentcpu->local_vcpu != vcpu)
650                 panic("vmx_put_cpu: asked to clear something not ours");
651
652         ept_sync_context(vcpu_get_eptp(vcpu));
653         vmcs_clear(vcpu->vmcs);
654         vcpu->cpu = -1;
655         currentcpu->local_vcpu = NULL;
656         //put_cpu();
657 }
658
659 /**
660  * vmx_dump_cpu - prints the CPU state
661  * @vcpu: VCPU to print
662  */
663 static void vmx_dump_cpu(struct vmx_vcpu *vcpu)
664 {
665
666         unsigned long flags;
667
668         vmx_get_cpu(vcpu);
669         vcpu->regs.tf_rip = vmcs_readl(GUEST_RIP);
670         vcpu->regs.tf_rsp = vmcs_readl(GUEST_RSP);
671         flags = vmcs_readl(GUEST_RFLAGS);
672         vmx_put_cpu(vcpu);
673
674         printk("--- Begin VCPU Dump ---\n");
675         printk("CPU %d VPID %d\n", vcpu->cpu, 0);
676         printk("RIP 0x%016lx RFLAGS 0x%08lx\n",
677                vcpu->regs.tf_rip, flags);
678         printk("RAX 0x%016lx RCX 0x%016lx\n",
679                 vcpu->regs.tf_rax, vcpu->regs.tf_rcx);
680         printk("RDX 0x%016lx RBX 0x%016lx\n",
681                 vcpu->regs.tf_rdx, vcpu->regs.tf_rbx);
682         printk("RSP 0x%016lx RBP 0x%016lx\n",
683                 vcpu->regs.tf_rsp, vcpu->regs.tf_rbp);
684         printk("RSI 0x%016lx RDI 0x%016lx\n",
685                 vcpu->regs.tf_rsi, vcpu->regs.tf_rdi);
686         printk("R8  0x%016lx R9  0x%016lx\n",
687                 vcpu->regs.tf_r8, vcpu->regs.tf_r9);
688         printk("R10 0x%016lx R11 0x%016lx\n",
689                 vcpu->regs.tf_r10, vcpu->regs.tf_r11);
690         printk("R12 0x%016lx R13 0x%016lx\n",
691                 vcpu->regs.tf_r12, vcpu->regs.tf_r13);
692         printk("R14 0x%016lx R15 0x%016lx\n",
693                 vcpu->regs.tf_r14, vcpu->regs.tf_r15);
694         printk("--- End VCPU Dump ---\n");
695
696 }
697
698 uint64_t construct_eptp(physaddr_t root_hpa)
699 {
700         uint64_t eptp;
701
702         /* set WB memory and 4 levels of walk.  we checked these in ept_init */
703         eptp = VMX_EPT_MEM_TYPE_WB |
704                (VMX_EPT_GAW_4_LVL << VMX_EPT_GAW_EPTP_SHIFT);
705         if (cpu_has_vmx_ept_ad_bits())
706                 eptp |= VMX_EPT_AD_ENABLE_BIT;
707         eptp |= (root_hpa & PAGE_MASK);
708
709         return eptp;
710 }
711
712 /**
713  * vmx_setup_initial_guest_state - configures the initial state of guest registers
714  */
715 static void vmx_setup_initial_guest_state(void)
716 {
717         unsigned long tmpl;
718         unsigned long cr4 = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSXMMEXCPT |
719                             X86_CR4_PGE | X86_CR4_OSFXSR;
720         uint32_t protected_mode = X86_CR0_PG | X86_CR0_PE;
721 #if 0
722         do we need it
723         if (boot_cpu_has(X86_FEATURE_PCID))
724                 cr4 |= X86_CR4_PCIDE;
725         if (boot_cpu_has(X86_FEATURE_OSXSAVE))
726                 cr4 |= X86_CR4_OSXSAVE;
727 #endif
728         /* we almost certainly have this */
729         /* we'll go sour if we don't. */
730         if (1) //boot_cpu_has(X86_FEATURE_FSGSBASE))
731                 cr4 |= X86_CR4_RDWRGSFS;
732
733         /* configure control and data registers */
734         vmcs_writel(GUEST_CR0, protected_mode | X86_CR0_WP |
735                                X86_CR0_MP | X86_CR0_ET | X86_CR0_NE);
736         vmcs_writel(CR0_READ_SHADOW, protected_mode | X86_CR0_WP |
737                                      X86_CR0_MP | X86_CR0_ET | X86_CR0_NE);
738         vmcs_writel(GUEST_CR3, rcr3());
739         vmcs_writel(GUEST_CR4, cr4);
740         vmcs_writel(CR4_READ_SHADOW, cr4);
741         vmcs_writel(GUEST_IA32_EFER, EFER_LME | EFER_LMA |
742                                      EFER_SCE | EFER_FFXSR);
743         vmcs_writel(GUEST_GDTR_BASE, 0);
744         vmcs_writel(GUEST_GDTR_LIMIT, 0);
745         vmcs_writel(GUEST_IDTR_BASE, 0);
746         vmcs_writel(GUEST_IDTR_LIMIT, 0);
747         vmcs_writel(GUEST_RIP, 0xdeadbeef);
748         vmcs_writel(GUEST_RSP, 0xdeadbeef);
749         vmcs_writel(GUEST_RFLAGS, 0x02);
750         vmcs_writel(GUEST_DR7, 0);
751
752         /* guest segment bases */
753         vmcs_writel(GUEST_CS_BASE, 0);
754         vmcs_writel(GUEST_DS_BASE, 0);
755         vmcs_writel(GUEST_ES_BASE, 0);
756         vmcs_writel(GUEST_GS_BASE, 0);
757         vmcs_writel(GUEST_SS_BASE, 0);
758         rdmsrl(MSR_FS_BASE, tmpl);
759         vmcs_writel(GUEST_FS_BASE, tmpl);
760
761         /* guest segment access rights */
762         vmcs_writel(GUEST_CS_AR_BYTES, 0xA09B);
763         vmcs_writel(GUEST_DS_AR_BYTES, 0xA093);
764         vmcs_writel(GUEST_ES_AR_BYTES, 0xA093);
765         vmcs_writel(GUEST_FS_AR_BYTES, 0xA093);
766         vmcs_writel(GUEST_GS_AR_BYTES, 0xA093);
767         vmcs_writel(GUEST_SS_AR_BYTES, 0xA093);
768
769         /* guest segment limits */
770         vmcs_write32(GUEST_CS_LIMIT, 0xFFFFFFFF);
771         vmcs_write32(GUEST_DS_LIMIT, 0xFFFFFFFF);
772         vmcs_write32(GUEST_ES_LIMIT, 0xFFFFFFFF);
773         vmcs_write32(GUEST_FS_LIMIT, 0xFFFFFFFF);
774         vmcs_write32(GUEST_GS_LIMIT, 0xFFFFFFFF);
775         vmcs_write32(GUEST_SS_LIMIT, 0xFFFFFFFF);
776
777         /* configure segment selectors */
778         vmcs_write16(GUEST_CS_SELECTOR, 0);
779         vmcs_write16(GUEST_DS_SELECTOR, 0);
780         vmcs_write16(GUEST_ES_SELECTOR, 0);
781         vmcs_write16(GUEST_FS_SELECTOR, 0);
782         vmcs_write16(GUEST_GS_SELECTOR, 0);
783         vmcs_write16(GUEST_SS_SELECTOR, 0);
784         vmcs_write16(GUEST_TR_SELECTOR, 0);
785
786         /* guest LDTR */
787         vmcs_write16(GUEST_LDTR_SELECTOR, 0);
788         vmcs_writel(GUEST_LDTR_AR_BYTES, 0x0082);
789         vmcs_writel(GUEST_LDTR_BASE, 0);
790         vmcs_writel(GUEST_LDTR_LIMIT, 0);
791
792         /* guest TSS */
793         vmcs_writel(GUEST_TR_BASE, 0);
794         vmcs_writel(GUEST_TR_AR_BYTES, 0x0080 | AR_TYPE_BUSY_64_TSS);
795         vmcs_writel(GUEST_TR_LIMIT, 0xff);
796
797         /* initialize sysenter */
798         vmcs_write32(GUEST_SYSENTER_CS, 0);
799         vmcs_writel(GUEST_SYSENTER_ESP, 0);
800         vmcs_writel(GUEST_SYSENTER_EIP, 0);
801
802         /* other random initialization */
803         vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
804         vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
805         vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
806         vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
807         vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);  /* 22.2.1 */
808 }
809
810 static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, uint32_t msr)
811 {
812         int f = sizeof(unsigned long);
813         /*
814          * See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
815          * have the write-low and read-high bitmap offsets the wrong way round.
816          * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
817          */
818         if (msr <= 0x1fff) {
819                 __clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
820                 __clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
821         } else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
822                 msr &= 0x1fff;
823                 __clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
824                 __clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
825         }
826 }
827
828 static void setup_msr(struct vmx_vcpu *vcpu)
829 {
830         int set[] = { MSR_LSTAR };
831         struct vmx_msr_entry *e;
832         int sz = sizeof(set) / sizeof(*set);
833         int i;
834
835         //BUILD_BUG_ON(sz > NR_AUTOLOAD_MSRS);
836
837         vcpu->msr_autoload.nr = sz;
838
839         /* XXX enable only MSRs in set */
840         vmcs_write64(MSR_BITMAP, PADDR(msr_bitmap));
841
842         vmcs_write32(VM_EXIT_MSR_STORE_COUNT, vcpu->msr_autoload.nr);
843         vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vcpu->msr_autoload.nr);
844         vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vcpu->msr_autoload.nr);
845
846         vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, PADDR(vcpu->msr_autoload.host));
847         vmcs_write64(VM_EXIT_MSR_STORE_ADDR, PADDR(vcpu->msr_autoload.guest));
848         vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, PADDR(vcpu->msr_autoload.guest));
849
850         for (i = 0; i < sz; i++) {
851                 uint64_t val;
852
853                 e = &vcpu->msr_autoload.host[i];
854                 e->index = set[i];
855                 __vmx_disable_intercept_for_msr(msr_bitmap, e->index);
856                 rdmsrl(e->index, val);
857                 e->value = val;
858
859                 e = &vcpu->msr_autoload.guest[i];
860                 e->index = set[i];
861                 e->value = 0xDEADBEEF;
862         }
863 }
864
865 /**
866  *  vmx_setup_vmcs - configures the vmcs with starting parameters
867  */
868 static void vmx_setup_vmcs(struct vmx_vcpu *vcpu)
869 {
870         vmcs_write16(VIRTUAL_PROCESSOR_ID, 0);
871         vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
872
873         /* Control */
874         vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
875                 vmcs_config.pin_based_exec_ctrl);
876
877         vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
878                 vmcs_config.cpu_based_exec_ctrl);
879
880         if (cpu_has_secondary_exec_ctrls()) {
881                 vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
882                              vmcs_config.cpu_based_2nd_exec_ctrl);
883         }
884
885         vmcs_write64(EPT_POINTER, vcpu_get_eptp(vcpu));
886
887         vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
888         vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
889         vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
890
891         setup_msr(vcpu);
892 #if 0
893         if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
894                 uint32_t msr_low, msr_high;
895                 uint64_t host_pat;
896                 rdmsr(MSR_IA32_CR_PAT, msr_low, msr_high);
897                 host_pat = msr_low | ((uint64_t) msr_high << 32);
898                 /* Write the default value follow host pat */
899                 vmcs_write64(GUEST_IA32_PAT, host_pat);
900                 /* Keep arch.pat sync with GUEST_IA32_PAT */
901                 vmx->vcpu.arch.pat = host_pat;
902         }
903 #endif
904 #if 0
905         for (int i = 0; i < NR_VMX_MSR; ++i) {
906                 uint32_t index = vmx_msr_index[i];
907                 uint32_t data_low, data_high;
908                 int j = vmx->nmsrs;
909                 // TODO we should have read/writemsr_safe
910 #if 0
911                 if (rdmsr_safe(index, &data_low, &data_high) < 0)
912                         continue;
913                 if (wrmsr_safe(index, data_low, data_high) < 0)
914                         continue;
915 #endif
916                 vmx->guest_msrs[j].index = i;
917                 vmx->guest_msrs[j].data = 0;
918                 vmx->guest_msrs[j].mask = -1ull;
919                 ++vmx->nmsrs;
920         }
921 #endif
922
923         vmcs_config.vmentry_ctrl |= VM_ENTRY_IA32E_MODE;
924
925         vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
926         vmcs_write32(VM_ENTRY_CONTROLS, vmcs_config.vmentry_ctrl);
927
928         vmcs_writel(CR0_GUEST_HOST_MASK, ~0ul);
929         vmcs_writel(CR4_GUEST_HOST_MASK, ~0ul);
930
931         //kvm_write_tsc(&vmx->vcpu, 0);
932         vmcs_writel(TSC_OFFSET, 0);
933
934         vmx_setup_constant_host_state();
935 }
936
937 /**
938  * vmx_create_vcpu - allocates and initializes a new virtual cpu
939  *
940  * Returns: A new VCPU structure
941  */
942 struct vmx_vcpu *vmx_create_vcpu(struct proc *p)
943 {
944         struct vmx_vcpu *vcpu = kmalloc(sizeof(struct vmx_vcpu), KMALLOC_WAIT);
945         if (!vcpu) {
946                 return NULL;
947         }
948
949         memset(vcpu, 0, sizeof(*vcpu));
950
951         vcpu->proc = p; /* uncounted (weak) reference */
952         vcpu->vmcs = vmx_alloc_vmcs();
953         printd("%d: vcpu->vmcs is %p\n", core_id(), vcpu->vmcs);
954         if (!vcpu->vmcs)
955                 goto fail_vmcs;
956
957         vcpu->cpu = -1;
958
959         vmx_get_cpu(vcpu);
960         vmx_setup_vmcs(vcpu);
961         vmx_setup_initial_guest_state();
962         vmx_put_cpu(vcpu);
963
964         return vcpu;
965
966 fail_vmcs:
967         kfree(vcpu);
968         return NULL;
969 }
970
971 /**
972  * vmx_destroy_vcpu - destroys and frees an existing virtual cpu
973  * @vcpu: the VCPU to destroy
974  */
975 void vmx_destroy_vcpu(struct vmx_vcpu *vcpu)
976 {
977         vmx_free_vmcs(vcpu->vmcs);
978         kfree(vcpu);
979 }
980
981 /**
982  * vmx_current_vcpu - returns a pointer to the vcpu for the current task.
983  *
984  * In the contexts where this is used the vcpu pointer should never be NULL.
985  */
986 static inline struct vmx_vcpu *vmx_current_vcpu(void)
987 {
988         struct vmx_vcpu *vcpu = currentcpu->local_vcpu;
989         if (!vcpu)
990                 panic("Core has no vcpu!");
991         return vcpu;
992 }
993
994 /**
995  * vmx_run_vcpu - launches the CPU into non-root mode
996  * We ONLY support 64-bit guests.
997  * @vcpu: the vmx instance to launch
998  */
999 static int vmx_run_vcpu(struct vmx_vcpu *vcpu)
1000 {
1001         asm(
1002                 /* Store host registers */
1003                 "push %%rdx; push %%rbp;"
1004                 "push %%rcx \n\t" /* placeholder for guest rcx */
1005                 "push %%rcx \n\t"
1006                 "cmp %%rsp, %c[host_rsp](%0) \n\t"
1007                 "je 1f \n\t"
1008                 "mov %%rsp, %c[host_rsp](%0) \n\t"
1009                 ASM_VMX_VMWRITE_RSP_RDX "\n\t"
1010                 "1: \n\t"
1011                 /* Reload cr2 if changed */
1012                 "mov %c[cr2](%0), %%rax \n\t"
1013                 "mov %%cr2, %%rdx \n\t"
1014                 "cmp %%rax, %%rdx \n\t"
1015                 "je 2f \n\t"
1016                 "mov %%rax, %%cr2 \n\t"
1017                 "2: \n\t"
1018                 /* Check if vmlaunch of vmresume is needed */
1019                 "cmpl $0, %c[launched](%0) \n\t"
1020                 /* Load guest registers.  Don't clobber flags. */
1021                 "mov %c[rax](%0), %%rax \n\t"
1022                 "mov %c[rbx](%0), %%rbx \n\t"
1023                 "mov %c[rdx](%0), %%rdx \n\t"
1024                 "mov %c[rsi](%0), %%rsi \n\t"
1025                 "mov %c[rdi](%0), %%rdi \n\t"
1026                 "mov %c[rbp](%0), %%rbp \n\t"
1027                 "mov %c[r8](%0),  %%r8  \n\t"
1028                 "mov %c[r9](%0),  %%r9  \n\t"
1029                 "mov %c[r10](%0), %%r10 \n\t"
1030                 "mov %c[r11](%0), %%r11 \n\t"
1031                 "mov %c[r12](%0), %%r12 \n\t"
1032                 "mov %c[r13](%0), %%r13 \n\t"
1033                 "mov %c[r14](%0), %%r14 \n\t"
1034                 "mov %c[r15](%0), %%r15 \n\t"
1035                 "mov %c[rcx](%0), %%rcx \n\t" /* kills %0 (ecx) */
1036
1037                 /* Enter guest mode */
1038                 "jne .Llaunched \n\t"
1039                 ASM_VMX_VMLAUNCH "\n\t"
1040                 "jmp .Lkvm_vmx_return \n\t"
1041                 ".Llaunched: " ASM_VMX_VMRESUME "\n\t"
1042                 ".Lkvm_vmx_return: "
1043                 /* Save guest registers, load host registers, keep flags */
1044                 "mov %0, %c[wordsize](%%rsp) \n\t"
1045                 "pop %0 \n\t"
1046                 "mov %%rax, %c[rax](%0) \n\t"
1047                 "mov %%rbx, %c[rbx](%0) \n\t"
1048                 "popq %c[rcx](%0) \n\t"
1049                 "mov %%rdx, %c[rdx](%0) \n\t"
1050                 "mov %%rsi, %c[rsi](%0) \n\t"
1051                 "mov %%rdi, %c[rdi](%0) \n\t"
1052                 "mov %%rbp, %c[rbp](%0) \n\t"
1053                 "mov %%r8,  %c[r8](%0) \n\t"
1054                 "mov %%r9,  %c[r9](%0) \n\t"
1055                 "mov %%r10, %c[r10](%0) \n\t"
1056                 "mov %%r11, %c[r11](%0) \n\t"
1057                 "mov %%r12, %c[r12](%0) \n\t"
1058                 "mov %%r13, %c[r13](%0) \n\t"
1059                 "mov %%r14, %c[r14](%0) \n\t"
1060                 "mov %%r15, %c[r15](%0) \n\t"
1061                 "mov %%rax, %%r10 \n\t"
1062                 "mov %%rdx, %%r11 \n\t"
1063
1064                 "mov %%cr2, %%rax   \n\t"
1065                 "mov %%rax, %c[cr2](%0) \n\t"
1066
1067                 "pop  %%rbp; pop  %%rdx \n\t"
1068                 "setbe %c[fail](%0) \n\t"
1069                 "mov $" STRINGIFY(GD_UD) ", %%rax \n\t"
1070                 "mov %%rax, %%ds \n\t"
1071                 "mov %%rax, %%es \n\t"
1072               : : "c"(vcpu), "d"((unsigned long)HOST_RSP),
1073                 [launched]"i"(offsetof(struct vmx_vcpu, launched)),
1074                 [fail]"i"(offsetof(struct vmx_vcpu, fail)),
1075                 [host_rsp]"i"(offsetof(struct vmx_vcpu, host_rsp)),
1076                 [rax]"i"(offsetof(struct vmx_vcpu, regs.tf_rax)),
1077                 [rbx]"i"(offsetof(struct vmx_vcpu, regs.tf_rbx)),
1078                 [rcx]"i"(offsetof(struct vmx_vcpu, regs.tf_rcx)),
1079                 [rdx]"i"(offsetof(struct vmx_vcpu, regs.tf_rdx)),
1080                 [rsi]"i"(offsetof(struct vmx_vcpu, regs.tf_rsi)),
1081                 [rdi]"i"(offsetof(struct vmx_vcpu, regs.tf_rdi)),
1082                 [rbp]"i"(offsetof(struct vmx_vcpu, regs.tf_rbp)),
1083                 [r8]"i"(offsetof(struct vmx_vcpu, regs.tf_r8)),
1084                 [r9]"i"(offsetof(struct vmx_vcpu, regs.tf_r9)),
1085                 [r10]"i"(offsetof(struct vmx_vcpu, regs.tf_r10)),
1086                 [r11]"i"(offsetof(struct vmx_vcpu, regs.tf_r11)),
1087                 [r12]"i"(offsetof(struct vmx_vcpu, regs.tf_r12)),
1088                 [r13]"i"(offsetof(struct vmx_vcpu, regs.tf_r13)),
1089                 [r14]"i"(offsetof(struct vmx_vcpu, regs.tf_r14)),
1090                 [r15]"i"(offsetof(struct vmx_vcpu, regs.tf_r15)),
1091                 [cr2]"i"(offsetof(struct vmx_vcpu, cr2)),
1092                 [wordsize]"i"(sizeof(unsigned long))
1093               : "cc", "memory"
1094                 , "rax", "rbx", "rdi", "rsi"
1095                 , "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
1096         );
1097
1098         vcpu->regs.tf_rip = vmcs_readl(GUEST_RIP);
1099         vcpu->regs.tf_rsp = vmcs_readl(GUEST_RSP);
1100         printk("RETURN. ip %016lx sp %016lx cr2 %016lx\n",
1101                vcpu->regs.tf_rip, vcpu->regs.tf_rsp, vcpu->cr2);
1102         /* FIXME: do we need to set up other flags? */
1103         vcpu->regs.tf_rflags = (vmcs_readl(GUEST_RFLAGS) & 0xFF) |
1104                       X86_EFLAGS_IF | 0x2;
1105
1106         vcpu->regs.tf_cs = GD_UT;
1107         vcpu->regs.tf_ss = GD_UD;
1108
1109         vcpu->launched = 1;
1110
1111         if (vcpu->fail) {
1112                 printk("failure detected (err %x)\n",
1113                        vmcs_read32(VM_INSTRUCTION_ERROR));
1114                 return VMX_EXIT_REASONS_FAILED_VMENTRY;
1115         }
1116
1117         return vmcs_read32(VM_EXIT_REASON);
1118
1119 #if 0
1120         vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
1121         vmx_complete_atomic_exit(vmx);
1122         vmx_recover_nmi_blocking(vmx);
1123         vmx_complete_interrupts(vmx);
1124 #endif
1125 }
1126
1127 static void vmx_step_instruction(void)
1128 {
1129         vmcs_writel(GUEST_RIP, vmcs_readl(GUEST_RIP) +
1130                                vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
1131 }
1132
1133 static int vmx_handle_ept_violation(struct vmx_vcpu *vcpu)
1134 {
1135         unsigned long gva, gpa;
1136         int exit_qual, ret = -1;
1137         page_t *page;
1138
1139         vmx_get_cpu(vcpu);
1140         exit_qual = vmcs_read32(EXIT_QUALIFICATION);
1141         gva = vmcs_readl(GUEST_LINEAR_ADDRESS);
1142         gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
1143
1144         vmx_put_cpu(vcpu);
1145
1146         int prot = 0;
1147         prot |= exit_qual & VMX_EPT_FAULT_READ ? PROT_READ : 0;
1148         prot |= exit_qual & VMX_EPT_FAULT_WRITE ? PROT_WRITE : 0;
1149         prot |= exit_qual & VMX_EPT_FAULT_INS ? PROT_EXEC : 0;
1150         ret = handle_page_fault(current, gpa, prot);
1151
1152         if (ret) {
1153                 printk("EPT page fault failure GPA: %p, GVA: %p\n", gpa, gva);
1154                 vmx_dump_cpu(vcpu);
1155         }
1156
1157         return ret;
1158 }
1159
1160 static void vmx_handle_cpuid(struct vmx_vcpu *vcpu)
1161 {
1162         unsigned int eax, ebx, ecx, edx;
1163
1164         eax = vcpu->regs.tf_rax;
1165         ecx = vcpu->regs.tf_rcx;
1166         cpuid(0, 2, &eax, &ebx, &ecx, &edx);
1167         vcpu->regs.tf_rax = eax;
1168         vcpu->regs.tf_rbx = ebx;
1169         vcpu->regs.tf_rcx = ecx;
1170         vcpu->regs.tf_rdx = edx;
1171 }
1172
1173 static int vmx_handle_nmi_exception(struct vmx_vcpu *vcpu)
1174 {
1175         uint32_t intr_info;
1176
1177         vmx_get_cpu(vcpu);
1178         intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
1179         vmx_put_cpu(vcpu);
1180
1181         printk("vmx (vcpu %p): got an exception\n", vcpu);
1182         printk("vmx (vcpu %p): pid %d\n", vcpu, vcpu->proc->pid);
1183         if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR) {
1184                 return 0;
1185         }
1186
1187         printk("unhandled nmi, intr_info %x\n", intr_info);
1188         return -EIO;
1189 }
1190
1191
1192 static void noop(void) {
1193         __asm__ __volatile__ ("1: jmp 1b");
1194 }
1195
1196 static void fail(void) {
1197         __asm__ __volatile__ ("movq $0xdeadbeef, %rbx; movq 0, %rax");
1198 }
1199
1200 static unsigned long stack[512];
1201 /**
1202  * vmx_launch - the main loop for a VMX Dune process
1203  * @conf: the launch configuration
1204  */
1205 int vmx_launch(uint64_t rip, uint64_t rsp, uint64_t cr3)
1206 {
1207         int ret;
1208         struct vmx_vcpu *vcpu;
1209         int i = 0;
1210         int errors = 0;
1211
1212         if (rip < 4096 ) {
1213                 // testing.
1214                 switch(rip) {
1215                 default:
1216                         rip = (uint64_t)noop + 4;
1217                         break;
1218                 case 1:
1219                         rip = (uint64_t)fail + 4;
1220                         break;
1221                 }
1222         }
1223
1224         if (cr3 == 0) {
1225                 cr3 = rcr3();
1226         }
1227
1228         /* sanity checking.  -- later
1229         ret = ept_check_page(ept, rip);
1230         if (ret) {
1231                 printk("0x%x is not mapped in the ept!\n", rip);
1232                 errors++;
1233         }
1234         ret = ept_check_page(ept, rsp);
1235         if (ret) {
1236                 printk("0x%x is not mapped in the ept!\n", rsp);
1237                 errors++;
1238         }
1239         */
1240         if (errors) {
1241                 return -EINVAL;
1242         }
1243
1244
1245         printk("RUNNING: %s: rip %p rsp %p cr3 %p \n",
1246                __func__, rip, rsp, cr3);
1247         /* TODO: dirty hack til we have VMM contexts */
1248         vcpu = current->vmm.guest_pcores[0];
1249         if (!vcpu) {
1250                 printk("Failed to get a CPU!\n");
1251                 return -ENOMEM;
1252         }
1253
1254         vmx_get_cpu(vcpu);
1255         vmcs_writel(GUEST_RIP, rip);
1256         vmcs_writel(GUEST_RSP, rsp);
1257         vmcs_writel(GUEST_CR3, cr3);
1258         vmx_put_cpu(vcpu);
1259
1260         vcpu->ret_code = -1;
1261
1262         while (1) {
1263                 vmx_get_cpu(vcpu);
1264
1265                 // TODO: manage the fpu when we restart.
1266
1267                 // TODO: see if we need to exit before we go much further.
1268                 disable_irq();
1269                 ret = vmx_run_vcpu(vcpu);
1270                 enable_irq();
1271                 vmx_put_cpu(vcpu);
1272
1273                 if (ret == EXIT_REASON_VMCALL) {
1274                         vcpu->shutdown = SHUTDOWN_UNHANDLED_EXIT_REASON;
1275                         printk("system call! WTF\n");
1276                 } else if (ret == EXIT_REASON_CPUID)
1277                         vmx_handle_cpuid(vcpu);
1278                 else if (ret == EXIT_REASON_EPT_VIOLATION) {
1279                         if (vmx_handle_ept_violation(vcpu))
1280                                 vcpu->shutdown = SHUTDOWN_EPT_VIOLATION;
1281                 } else if (ret == EXIT_REASON_EXCEPTION_NMI) {
1282                         if (vmx_handle_nmi_exception(vcpu))
1283                                 vcpu->shutdown = SHUTDOWN_NMI_EXCEPTION;
1284                 } else if (ret == EXIT_REASON_EXTERNAL_INTERRUPT) {
1285                         printk("External interrupt\n");
1286                 } else {
1287                         printk("unhandled exit: reason 0x%x, exit qualification 0x%x\n",
1288                                ret, vmcs_read32(EXIT_QUALIFICATION));
1289                         vmx_dump_cpu(vcpu);
1290                         vcpu->shutdown = SHUTDOWN_UNHANDLED_EXIT_REASON;
1291                 }
1292
1293                 /* TODO: we can't just return and relaunch the VMCS, in case we blocked.
1294                  * similar to how proc_restartcore/smp_idle only restart the pcpui
1295                  * cur_ctx, we need to do the same, via the VMCS resume business. */
1296
1297                 if (vcpu->shutdown)
1298                         break;
1299         }
1300
1301         printk("RETURN. ip %016lx sp %016lx\n",
1302                 vcpu->regs.tf_rip, vcpu->regs.tf_rsp);
1303
1304         /*
1305          * Return both the reason for the shutdown and a status value.
1306          * The exit() and exit_group() system calls only need 8 bits for
1307          * the status but we allow 16 bits in case we might want to
1308          * return more information for one of the other shutdown reasons.
1309          */
1310         ret = (vcpu->shutdown << 16) | (vcpu->ret_code & 0xffff);
1311
1312         return ret;
1313 }
1314
1315 /**
1316  * __vmx_enable - low-level enable of VMX mode on the current CPU
1317  * @vmxon_buf: an opaque buffer for use as the VMXON region
1318  */
1319 static  int __vmx_enable(struct vmcs *vmxon_buf)
1320 {
1321         uint64_t phys_addr = PADDR(vmxon_buf);
1322         uint64_t old, test_bits;
1323
1324         if (rcr4() & X86_CR4_VMXE) {
1325                 panic("Should never have this happen");
1326                 return -EBUSY;
1327         }
1328
1329         rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
1330
1331         test_bits = FEATURE_CONTROL_LOCKED;
1332         test_bits |= FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
1333
1334         if (0) // tboot_enabled())
1335                 test_bits |= FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX;
1336
1337         if ((old & test_bits) != test_bits) {
1338                 /* If it's locked, then trying to set it will cause a GPF.
1339                  * No Dune for you!
1340                  */
1341                 if (old & FEATURE_CONTROL_LOCKED) {
1342                         printk("Dune: MSR_IA32_FEATURE_CONTROL is locked!\n");
1343                         return -1;
1344                 }
1345
1346                 /* enable and lock */
1347                 write_msr(MSR_IA32_FEATURE_CONTROL, old | test_bits);
1348         }
1349         lcr4(rcr4() | X86_CR4_VMXE);
1350
1351         __vmxon(phys_addr);
1352         vpid_sync_vcpu_global();        /* good idea, even if we aren't using vpids */
1353         ept_sync_global();
1354
1355         return 0;
1356 }
1357
1358 /**
1359  * vmx_enable - enables VMX mode on the current CPU
1360  * @unused: not used (required for on_each_cpu())
1361  *
1362  * Sets up necessary state for enable (e.g. a scratchpad for VMXON.)
1363  */
1364 static void vmx_enable(void)
1365 {
1366         struct vmcs *vmxon_buf = currentcpu->vmxarea;
1367         int ret;
1368
1369         ret = __vmx_enable(vmxon_buf);
1370         if (ret)
1371                 goto failed;
1372
1373         currentcpu->vmx_enabled = 1;
1374         // TODO: do we need this?
1375         store_gdt(&currentcpu->host_gdt);
1376
1377         printk("VMX enabled on CPU %d\n", core_id());
1378         return;
1379
1380 failed:
1381         printk("Failed to enable VMX on core %d, err = %d\n", core_id(), ret);
1382 }
1383
1384 /**
1385  * vmx_disable - disables VMX mode on the current CPU
1386  */
1387 static void vmx_disable(void *unused)
1388 {
1389         if (currentcpu->vmx_enabled) {
1390                 __vmxoff();
1391                 lcr4(rcr4() & ~X86_CR4_VMXE);
1392                 currentcpu->vmx_enabled = 0;
1393         }
1394 }
1395
1396 /* Probe the cpus to see which ones can do vmx.
1397  * Return -errno if it fails, and 1 if it succeeds.
1398  */
1399 static bool probe_cpu_vmx(void)
1400 {
1401         /* The best way to test this code is:
1402          * wrmsr -p <cpu> 0x3a 1
1403          * This will lock vmx off; then modprobe dune.
1404          * Frequently, however, systems have all 0x3a registers set to 5,
1405          * meaning testing is impossible, as vmx can not be disabled.
1406          * We have to simulate it being unavailable in most cases.
1407          * The 'test' variable provides an easy way to simulate
1408          * unavailability of vmx on some, none, or all cpus.
1409          */
1410         if (!cpu_has_vmx()) {
1411                 printk("Machine does not support VT-x\n");
1412                 return FALSE;
1413         } else {
1414                 printk("Machine supports VT-x\n");
1415                 return TRUE;
1416         }
1417 }
1418
1419 static void setup_vmxarea(void)
1420 {
1421                 struct vmcs *vmxon_buf;
1422                 printd("Set up vmxarea for cpu %d\n", core_id());
1423                 vmxon_buf = __vmx_alloc_vmcs(node_id());
1424                 if (!vmxon_buf) {
1425                         printk("setup_vmxarea failed on node %d\n", core_id());
1426                         return;
1427                 }
1428                 currentcpu->vmxarea = vmxon_buf;
1429 }
1430
1431 static int ept_init(void)
1432 {
1433         if (!cpu_has_vmx_ept()) {
1434                 printk("VMX doesn't support EPT!\n");
1435                 return -1;
1436         }
1437         if (!cpu_has_vmx_eptp_writeback()) {
1438                 printk("VMX EPT doesn't support WB memory!\n");
1439                 return -1;
1440         }
1441         if (!cpu_has_vmx_ept_4levels()) {
1442                 printk("VMX EPT doesn't support 4 level walks!\n");
1443                 return -1;
1444         }
1445         switch (arch_max_jumbo_page_shift()) {
1446                 case PML3_SHIFT:
1447                         if (!cpu_has_vmx_ept_1g_page()) {
1448                                 printk("VMX EPT doesn't support 1 GB pages!\n");
1449                                 return -1;
1450                         }
1451                         break;
1452                 case PML2_SHIFT:
1453                         if (!cpu_has_vmx_ept_2m_page()) {
1454                                 printk("VMX EPT doesn't support 2 MB pages!\n");
1455                                 return -1;
1456                         }
1457                         break;
1458                 default:
1459                         printk("Unexpected jumbo page size %d\n",
1460                                arch_max_jumbo_page_shift());
1461                         return -1;
1462         }
1463         if (!cpu_has_vmx_ept_ad_bits()) {
1464                 printk("VMX EPT doesn't support accessed/dirty!\n");
1465                 x86_ept_pte_fix_ups |= EPTE_A | EPTE_D;
1466         }
1467         if (!cpu_has_vmx_invept() || !cpu_has_vmx_invept_global()) {
1468                 printk("VMX EPT can't invalidate PTEs/TLBs!\n");
1469                 return -1;
1470         }
1471
1472         return 0;
1473 }
1474
1475 /**
1476  * vmx_init sets up physical core data areas that are required to run a vm at all.
1477  * These data areas are not connected to a specific user process in any way. Instead,
1478  * they are in some sense externalizing what would other wise be a very large ball of
1479  * state that would be inside the CPU.
1480  */
1481 int intel_vmm_init(void)
1482 {
1483         int r, cpu, ret;
1484
1485         if (! probe_cpu_vmx()) {
1486                 return -EOPNOTSUPP;
1487         }
1488
1489         setup_vmcs_config(&ret);
1490
1491         if (ret) {
1492                 printk("setup_vmcs_config failed: %d\n", ret);
1493                 return ret;
1494         }
1495
1496         msr_bitmap = (unsigned long *)kpage_zalloc_addr();
1497         if (!msr_bitmap) {
1498                 printk("Could not allocate msr_bitmap\n");
1499                 return -ENOMEM;
1500         }
1501         /* FIXME: do we need APIC virtualization (flexpriority?) */
1502
1503         memset(msr_bitmap, 0xff, PAGE_SIZE);
1504         __vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE);
1505         __vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE);
1506
1507         if ((ret = ept_init())) {
1508                 printk("EPT init failed, %d\n", ret);
1509                 return ret;
1510         }
1511         printk("VMX setup succeeded\n");
1512         return 0;
1513 }
1514
1515 int intel_vmm_pcpu_init(void)
1516 {
1517         setup_vmxarea();
1518         vmx_enable();
1519         return 0;
1520 }