Post

Abstractor: Linux, Syscalls and Hypervisors

Abstractor: Linux, Syscalls and Hypervisors

Part 1: Why Do We Even Need VMs?

What Linux Already Does

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Process A│ │ Process B│ │ Process C│
│ (user 1) │ │ (user 1) │ │ (user 2) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
     │            │            │
┌────┴────────────┴────────────┴────┐
│            KERNEL                  │
│  - Schedules CPU time              │
│  - Manages memory (page tables)    │
│  - Controls file access            │
│  - Enforces user permissions       │
└────────────────┬──────────────────┘
                 │
            Hardware

Already isolated:

ResourceHow Linux Isolates
CPU timeScheduler gives each process time slices
MemoryPage tables — process A can’t see process B’s RAM
FilesPermissions — user 2 can’t read user 1’s files
NetworkPorts, firewall rules

The Gaps — Why Linux Multi-User Isn’t Enough

ProblemWhy Linux Multi-User Isn’t Enough
Different OSProcess can’t be Windows. It’s all Linux.
Kernel bugsOne kernel. Bug affects everyone.
Root escapeIf process gets root, game over for ALL users.
Kernel versionEveryone shares same kernel. Can’t mix versions.
Full isolationProcesses share kernel memory, syscalls, timing. Information can leak.
Resource guaranteesScheduler is “fair” but not guaranteed. No hard limits.

What VMs Add

Each VM gets its own kernel. Total separation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
┌─────────────────┐  ┌─────────────────┐
│      VM 1       │  │      VM 2       │
│  ┌───────────┐  │  │  ┌───────────┐  │
│  │ Process A │  │  │  │ Process X │  │
│  └─────┬─────┘  │  │  └─────┬─────┘  │
│        │        │  │        │        │
│  ┌─────┴─────┐  │  │  ┌─────┴─────┐  │
│  │  Kernel 1 │  │  │  │  Kernel 2 │  │
│  │  (Linux)  │  │  │  │ (Windows) │  │
│  └───────────┘  │  │  └───────────┘  │
└────────┬────────┘  └────────┬────────┘
         │                    │
┌────────┴────────────────────┴────────┐
│            HYPERVISOR                 │
└──────────────────┬───────────────────┘
                   │
               Hardware

Now:

  • VM 1 root ≠ VM 2 root
  • VM 1 kernel crash ≠ VM 2 crash
  • VM 1 can run Linux, VM 2 can run Windows
  • Isolation is at hardware boundary, not syscall boundary

The Short Answer:

  • Linux multi-user: Trust the kernel. Isolation via permissions and page tables.
  • VMs: Don’t trust anything. Each tenant gets their own kernel. Isolation via fake hardware.

Part 2: Linux Internals — The Three Boundaries

Everything in Linux is about enforcing three boundaries:

1
2
3
4
5
6
7
8
9
10
11
12
┌─────────────────────────────────────────────────┐
│                                                 │
│   BOUNDARY 1: Process ←→ Process               │
│   (Memory isolation via page tables)           │
│                                                 │
│   BOUNDARY 2: Process ←→ Kernel                │
│   (Ring 3 vs Ring 0, syscalls)                 │
│                                                 │
│   BOUNDARY 3: Kernel ←→ Hardware               │
│   (Drivers, interrupts)                        │
│                                                 │
└─────────────────────────────────────────────────┘

Part 3: CPU Has Two Modes (Hardware Enforced)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
┌─────────────────────────────────────┐
│         RING 3 (User Mode)          │
│                                     │
│   - Your process runs here          │
│   - CANNOT execute privileged ops   │
│   - CANNOT access hardware directly │
│   - CANNOT see other process memory │
│                                     │
│   If you try → CPU raises exception │
└──────────────────┬──────────────────┘
                   │ syscall instruction
                   │ (controlled gate)
                   ▼
┌─────────────────────────────────────┐
│         RING 0 (Kernel Mode)        │
│                                     │
│   - Kernel runs here                │
│   - CAN touch hardware              │
│   - CAN see all memory              │
│   - CAN do anything                 │
└─────────────────────────────────────┘

The CPU itself enforces this. A register holds current ring. Hardware checks every instruction.

What you remember:

“CPU has a mode bit. Ring 0 vs Ring 3. Hardware checks every instruction. Some instructions only work in ring 0.”

If they ask details:

“I don’t remember which register, but there’s a bit that says ‘kernel mode’ or ‘user mode’. CPU checks it on privileged operations.”


Part 4: The Thin API (Syscalls)

Your process wants things. Hardware has things. Kernel is the middleman.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
┌──────────────────────────────────────┐
│            YOUR PROCESS              │
│                                      │
│   You can:                           │
│     - Do math                        │
│     - Access your own memory         │
│     - Call the thin API              │
│                                      │
│   You cannot:                        │
│     - Touch hardware                 │
│     - See other processes            │
│     - Do anything privileged         │
│                                      │
└──────────────────┬───────────────────┘
                   │
                   │  THE THIN API
                   │
                   │  get_handle(thing) → int
                   │  bytes_in(int) → data
                   │  bytes_out(int, data)
                   │  release(int)
                   │  do_anything(int, ???)
                   │
                   │  (officially: open, read, write, close, ioctl)
                   │
┌──────────────────▼───────────────────┐
│         KERNEL + HARDWARE            │
│                                      │
│   Figures out what handle means.     │
│   Does the actual work.              │
│   Returns result.                    │
│                                      │
└──────────────────────────────────────┘

How read(fd, buf, 100) Actually Works

1
read(fd, buf, 100);  // This calls a function in libc. That's it.

The layers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
YOUR CODE:
    read(fd, buf, 100);
         │
         │ normal function call
         ▼
LIBC (glibc, musl, etc):
    ssize_t read(int fd, void *buf, size_t count) {
        // Put args in registers
        // Do the actual syscall instruction
        return syscall(SYS_read, fd, buf, count);
    }
         │
         │ syscall instruction
         ▼
KERNEL:
    sys_read(fd, buf, count) {
        // actual work
    }

Why SYSCALL Is A CPU Instruction

Like ADD or MOV, there is literally an instruction called SYSCALL. It’s opcode 0x0F 0x05. Hardware knows what to do.

What SYSCALL does (hardware, atomically):

  1. Save rip → rcx (so kernel knows where to return)
  2. Save flags → r11
  3. Load rip from special register (MSR_LSTAR) — kernel set this at boot
  4. Set mode = privileged
  5. Continue at new rip (now in kernel)

Software can’t change the mode bit. That would require… privileged mode. Chicken and egg. So CPU provides a special instruction that does all of this atomically, safely.


Part 4.5: The Trap Table

SYSCALL is one entry point, but the CPU handles many other events like divide by zero, page fault, debug breakpoint, and timer tick. Each of these must transfer control to kernel code.

This transfer has to be indirect. If unprivileged code could jump directly to any kernel address, it would skip validation and corrupt state. So the CPU uses a table of fixed entry points that only the kernel can configure.

When event N happens, the CPU looks up entry N, switches to privileged mode, and jumps to that address.

1
2
3
4
5
6
7
8
9
10
┌────────┬─────────────────────┐
│ Number │ Jump address        │
├────────┼─────────────────────┤
│   0    │ 0x80100100          │  divide by zero
│   1    │ 0x80100200          │  debug
│  13    │ 0x80100D00          │  general protection fault
│  14    │ 0x80100E00          │  page fault
│  ...   │ ...                 │
│  64    │ 0x80103000          │  syscall (legacy)
└────────┴─────────────────────┘

The kernel writes this table at boot, and the lidt instruction tells the CPU where to find it. Both operations require privileged mode.

WhoDoesPrivileged?
KernelWrites table, runs lidtYes
User codeTriggers entry (int N, or fault)No
HardwareLooks up, switches mode, jumps

User code triggers. Kernel code configures. Hardware enforces.

(officially: Interrupt Descriptor Table / IDT)


Part 5: Page Tables (Memory Isolation)

The Lie

Every process thinks: “I have all memory to myself. I start at address 0. I’m alone.”

The Truth

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Process A's view:          Process B's view:

0x1000 → my variable       0x1000 → my variable

        │                          │
        ▼                          ▼
    ┌───────┐                  ┌───────┐
    │ MMU   │                  │ MMU   │
    │(CPU)  │                  │(CPU)  │
    └───┬───┘                  └───┬───┘
        │                          │
        ▼                          ▼
Physical 0x50000           Physical 0x80000
(different!)               (different!)

Page table = a map from virtual → physical

1
2
3
4
5
Virtual Address      Physical Address     Permissions
─────────────────────────────────────────────────────
0x0000 - 0x0FFF  →   UNMAPPED            (null trap)
0x1000 - 0x1FFF  →   0x50000             read, write
0x2000 - 0x2FFF  →   0x51000             read, execute

Key insight:

  • CPU has a register (CR3 on x86) pointing to current page table
  • On EVERY memory access, hardware translates
  • Process can’t change CR3 — it’s privileged (ring 0 only)
  • Kernel swaps CR3 when switching processes

Process Virtual Address Space

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0xFFFFFFFF  ┌─────────────────┐
            │  Kernel memory  │ ← Process CANNOT access
            │  (mapped but    │   (page table says no)
            │   protected)    │
0xC0000000  ├─────────────────┤
            │     Stack       │ ← Grows down
            │       ↓         │
            ├─────────────────┤
            │       ↑         │
            │     Heap        │ ← malloc() comes from here
            │  (brk / mmap)   │
            ├─────────────────┤
            │    .bss         │ ← Uninitialized globals
            ├─────────────────┤
            │    .data        │ ← Initialized globals
            ├─────────────────┤
            │    .text        │ ← Your code (read + execute)
0x00400000  ├─────────────────┤
            │   Unmapped      │ ← NULL pointer catches
0x00000000  └─────────────────┘

Part 6: What A Process Actually Is

Kernel’s View

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
struct task_struct {
    // IDENTITY
    pid_t pid;
    uid_t uid;
    gid_t gid;

    // SPACE
    struct mm_struct *mm;       // → page tables, mappings

    // STATE
    struct pt_regs regs;        // saved registers

    // RESOURCES (more IDENTITY)
    struct files_struct *files; // open file descriptors

    // SCHEDULING (TIME)
    int prio;
    u64 vruntime;               // how much time used

    // ...hundreds more fields
};

A process is just a struct. Kernel allocates it, fills it in, links it to scheduler.


Part 7: Context Switch — The Big Moment

There’s ONE Physical CPU. ONE Set of Registers.

When process is NOT running, its registers live in memory:

1
2
3
4
5
6
7
8
9
10
11
MEMORY (kernel heap):

    Process A's task_struct:       Process B's task_struct:
    ┌─────────────────────┐        ┌─────────────────────┐
    │ saved_rax = 5       │        │ saved_rax = 99      │
    │ saved_rbx = 10      │        │ saved_rbx = 200     │
    │ saved_rip = 0x4000  │        │ saved_rip = 0x5000  │
    │ saved_rsp = 0x7fff  │        │ saved_rsp = 0x8fff  │
    └─────────────────────┘        └─────────────────────┘

    Just numbers in memory!

Running vs Not Running

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PROCESS A RUNNING:

    CPU hardware numbers:        A's blob in memory:
    ┌──────────────────┐         ┌──────────────────┐
    │ calc = 5         │ ← LIVE  │ saved_calc = ?   │ (stale)
    │ position = 0x4000│         │ saved_pos = ?    │
    └──────────────────┘         └──────────────────┘


PROCESS A NOT RUNNING:

    CPU hardware numbers:        A's blob in memory:
    ┌──────────────────┐         ┌──────────────────┐
    │ (some other      │         │ saved_calc = 5   │ ← SAVED
    │  process's data) │         │ saved_pos = 0x4000│
    └──────────────────┘         └──────────────────┘

The Switch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SWITCHING A → B:

1. Hardware kicks kernel (timer interrupt)

2. Kernel saves CPU numbers → A's blob
   ┌──────────────────┐         ┌──────────────────┐
   │ calc = 5         │ ──────► │ saved_calc = 5   │
   │ position = 0x4000│         │ saved_pos = 0x4000│
   └──────────────────┘         └──────────────────┘

3. Kernel loads B's blob → CPU numbers
   ┌──────────────────┐         ┌──────────────────┐
   │ calc = 99        │ ◄────── │ saved_calc = 99  │
   │ position = 0x7000│         │ saved_pos = 0x7000│
   └──────────────────┘         └──────────────────┘

4. Kernel changes map pointer to B's map

5. Return to unprivileged mode

6. B continues. Never knew it was paused.

What gets saved/restored:

SavedWhy
Registers (rax, rbx, …)B’s computation state
Instruction pointerWhere B was executing
Stack pointerB’s stack position
CR3 (page table)B’s memory view
FPU/SSE stateB’s floating point

The insight: Registers are just numbers. When running: in hardware. When not running: in memory. “Copy registers” = copy numbers from one place to another.


Part 8: Process Interruption — What Actually Happens

The Scary Question

Process is in the middle of int x = a + b * c; — timer fires. What happens?

Answer: CPU Saves EVERYTHING

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PROCESS RUNNING:

    Registers:
    ┌─────────────────┐
    │ rax = 5         │ ← mid-calculation
    │ rbx = 10        │
    │ rip = 0x4001234 │ ← instruction pointer
    │ rsp = 0x7fff100 │ ← stack pointer
    │ flags = ...     │
    └─────────────────┘

         │
         │  TIMER INTERRUPT
         ▼

    CPU AUTOMATICALLY (hardware):
    1. Finishes current instruction
    2. Saves rip, rsp, flags to special place
    3. Switches to ring 0
    4. Jumps to interrupt handler

Key: CPU finishes current instruction. Never stops mid-instruction.

Each Process Has TWO Stacks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌─────────────────────────────────────────┐
│              PROCESS A                  │
│                                         │
│   User Stack          Kernel Stack      │
│   (your code uses)    (kernel uses)     │
│                                         │
│   ┌──────────┐       ┌──────────┐      │
│   │ local    │       │ saved    │      │
│   │ vars     │       │ regs     │      │
│   │ ...      │       │ from     │      │
│   │          │       │ interrupt│      │
│   └──────────┘       └──────────┘      │
│                                         │
└─────────────────────────────────────────┘

When interrupted:

  • CPU switches to kernel stack (automatically)
  • Kernel pushes registers there
  • User stack is untouched

Part 9: Fork and Exec — How Processes Launch

1
2
3
4
5
6
7
// Shell runs this when you type "./myprogram"
pid = fork();      // 1. Clone current process

if (pid == 0) {
    // Child process
    exec("./myprogram");  // 2. Replace with new program
}

What fork() Does

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
FORK DOES (inside kernel):

    1. Allocate NEW task_struct in memory

    2. Copy parent's CURRENT register values into it:

       Child's task_struct (NEW):
       ┌──────────────────┐
       │ pid = 101        │  ← new PID
       │ saved_rax = 5    │  ← copied from physical regs
       │ saved_rbx = 10   │  ← copied from physical regs
       │ saved_rip = 0x4000│ ← same instruction!
       │ page_table = ... │  ← copy or share parent's
       │ files = ...      │  ← copy parent's fd table
       └──────────────────┘

    3. Put child on scheduler's run queue

    4. Return to parent (still running)

Why Fork Returns Different Values

1
2
pid = fork();
if (pid == 0) { /* child */ } else { /* parent */ }

How?

1
2
3
4
5
6
7
8
9
10
11
KERNEL DOES:

    Parent's task_struct:        Child's task_struct:
    ┌──────────────────┐         ┌──────────────────┐
    │ saved_rax = 101  │         │ saved_rax = 0    │
    │ (child's pid)    │         │ (zero!)          │
    └──────────────────┘         └──────────────────┘

    rax is where return value goes.
    Kernel puts different values in each task_struct.
    When each runs, they get their own return value.

What exec() Does

  1. Open the executable file
  2. Parse ELF header (where’s code, data, entry point?)
  3. Wipe current memory mappings
  4. Map new segments: .text, .data, stack, heap
  5. Set instruction pointer to entry point
  6. Return to user mode

Part 10: File Descriptors — How I/O Works

1
int fd = open("/tmp/foo", O_RDONLY);

What happened:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
USER:   fd = 3 (just a number)
                │
KERNEL:         ▼
        ┌───────────────────────────┐
        │ Process's FD table        │
        │                           │
        │ 0 → stdin (terminal)      │
        │ 1 → stdout (terminal)     │
        │ 2 → stderr (terminal)     │
        │ 3 → struct file ──────────────┐
        └───────────────────────────┘   │
                                        ▼
                              ┌──────────────────┐
                              │  struct file     │
                              │  inode: ...      │
                              │  position: 0     │
                              │  ops: read/write │
                              └──────────────────┘

fd is just an index. Kernel holds the actual file struct. You can only refer to it by number. Kernel validates on every operation.


Part 11: Multi-User — Just Numbers

1
2
3
4
5
6
7
8
9
10
struct task_struct {
    uid_t uid;    // user id (just a number)
    gid_t gid;    // group id (just a number)
};

struct inode {
    uid_t uid;    // file owner
    gid_t gid;    // file group
    mode_t mode;  // permissions (rwxrwxrwx)
};

When you open():

1
2
3
4
5
6
7
8
// Kernel does:
if (process->uid == inode->uid) {
    // Check owner permissions
} else if (process->gid == inode->gid) {
    // Check group permissions
} else {
    // Check other permissions
}

That’s it. Users are just numbers. Permissions are just bits. Kernel does if-statements. Hardware knows nothing about users.


Part 12: The One-Page Linux Summary

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
┌─────────────────────────────────────────────────────────┐
│ HARDWARE ENFORCES:                                      │
│   - Ring 0 vs Ring 3 (CPU mode bit)                     │
│   - Page table translation (MMU on every access)        │
│   - Interrupts (timer forces kernel to get control)     │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│ KERNEL ENFORCES:                                        │
│   - Which process has which page table                  │
│   - Which process runs when (scheduler)                 │
│   - Which files you can access (uid/gid checks)         │
│   - Syscall validation (is this fd valid?)              │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│ PROCESS SEES:                                           │
│   - Illusion of own memory                              │
│   - Illusion of own CPU                                 │
│   - File descriptors (handles, not real access)         │
│   - Syscall API (request things, can't take them)       │
└─────────────────────────────────────────────────────────┘

PART TWO: HYPERVISORS


Part 13: Hypervisors — The Basics

A hypervisor is a program that pretends to be hardware.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌─────────┐ ┌─────────┐ ┌─────────┐
│  VM 1   │ │  VM 2   │ │  VM 3   │
│ (Linux) │ │(Windows)│ │ (Linux) │
└────┬────┘ └────┬────┘ └────┬────┘
     │           │           │
     │    "I am a computer"  │
     │           │           │
┌────┴───────────┴───────────┴────┐
│          HYPERVISOR             │
│   (lies to everyone above)      │
└────────────────┬────────────────┘
                 │
         Real Hardware
         (actual CPU, RAM, NIC)

Each VM thinks it has: its own CPU, RAM, network card, disk. It doesn’t. The hypervisor fakes all of it.


Part 14: Same Patterns, One Level Deeper

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
LINUX:                          HYPERVISOR:

Ring 3 (user process)           Ring 3 (user process)
    │                               │
    │ syscall                       │ syscall
    ▼                               ▼
Ring 0 (kernel)                 Ring 0 (guest kernel)
    │                               │
    │ ← this is the bottom          │ VM exit
    ▼                               ▼
Hardware                        Ring -1 (hypervisor)
                                    │
                                    │ ← NOW this is the bottom
                                    ▼
                                Hardware

Everything you learned about Linux applies. Just add one more layer.

The New Lie

1
2
3
4
5
6
7
Before hypervisor:
    Process thinks: "I have all memory"
    Kernel thinks:  "I control the hardware"

After hypervisor:
    Process thinks: "I have all memory"        (still a lie)
    Kernel thinks:  "I control the hardware"   (NOW ALSO A LIE)

Part 15: Same Concepts, Renamed

LinuxHypervisorSame Idea?
Ring 3 → Ring 0Ring 0 → Ring -1Yes, mode switch
SyscallVM exit / hypercallYes, controlled entry
Page tablesNested page tablesYes, two levels now
Process context switchVM context switchYes, save/restore state
Timer interruptVM preemptionYes, forced scheduling
Process = task_structVM = VMCSYes, state in memory

What’s Different

LinuxHypervisor
Virtualizes one programVirtualizes entire OS + hardware
Fakes “I have all memory”Fakes “I have CPU, RAM, NIC, disk”
Kernel trusts hardwareHypervisor doesn’t trust guest kernel
One level of page tablesTwo levels (guest + host)

Part 16: VM State Blob (VMCS)

Same idea as process. Just more stuff.

1
2
3
4
5
6
7
8
9
10
11
PROCESS STATE BLOB:            VM STATE BLOB:
┌─────────────────────┐        ┌─────────────────────┐
│ saved CPU numbers   │        │ saved CPU numbers   │
│ map pointer         │        │ map pointer         │
│ handles             │        │ guest's map pointer │ ← extra!
│                     │        │ guest's mode bit    │ ← extra!
│                     │        │ virtual devices     │ ← extra!
└─────────────────────┘        └─────────────────────┘

VM blob is bigger because we're faking entire hardware,
not just "a process."

Part 17: Two Levels of Maps (Nested Page Tables / EPT)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PROCESS (one map):

    Virtual → Physical
    0x1000  → 0x50000


VM (two maps):

    Guest Virtual → Guest Physical → Host Physical
    0x1000        → 0x2000         → 0x80000

    Guest thinks 0x1000 → 0x2000.
    But 0x2000 is ALSO fake!
    Real location is 0x80000.

    (officially: nested page tables / EPT)

Hardware Does Both Translations

1
2
3
4
5
6
7
8
9
10
MEMORY ACCESS IN VM:

    Guest code: load [0x1000]

    1. Guest page table: 0x1000 → 0x2000 (guest physical)
    2. Nested page table: 0x2000 → 0x8000 (real physical)
    3. Actually read from 0x8000

    Hardware does both lookups!
    (officially: EPT - Extended Page Tables)

Why It Matters

1
2
3
4
5
6
7
8
WITHOUT HARDWARE SUPPORT:
    Every memory access → VM exit → hypervisor translates
    Impossibly slow.

WITH HARDWARE SUPPORT (EPT):
    CPU does both translations automatically.
    No VM exit for normal memory access.
    This is why modern VMs are fast.

Part 18: When Does Hypervisor Get Control?

Same as kernel. Two doors:

1
2
3
4
5
6
7
8
9
10
11
12
1. VM does something sensitive
   - Change its map pointer
   - Talk to "hardware" (which is fake)
   - Execute privileged instruction

   CPU says: "That's not real hardware. Let me ask hypervisor."
   (officially: VM exit)

2. Hardware kick (interrupt)
   - Timer
   - Real network packet
   - Hypervisor needs to do something

Part 19: What Nitro Does Differently

Traditional Hypervisor

1
2
3
4
5
6
7
8
9
┌───────────────────────────────────────┐
│           HYPERVISOR                  │
│                                       │
│  - CPU scheduling         (complex)  │
│  - Memory management      (complex)  │
│  - Fake network card      (slow)     │
│  - Fake disk              (slow)     │
│                                       │
└───────────────────────────────────────┘

Nitro Splits It

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
┌───────────────────────────────────────┐
│           HYPERVISOR (tiny)           │
│                                       │
│  - CPU scheduling                     │
│  - Memory management                  │
│  - That's it                          │
│                                       │
└───────────────────────────────────────┘

┌───────────────────────────────────────┐
│           HARDWARE CARDS              │
│                                       │
│  - Network (Nitro card)               │
│  - Disk (Nitro card)                  │
│  - Real hardware, not emulated        │
│                                       │
└───────────────────────────────────────┘

Hypervisor does less. Hardware does I/O. Faster + simpler.

Why It’s Better

1
2
3
4
5
6
7
8
9
10
11
12
13
14
EMULATED NETWORK PACKET:
    VM sends packet
        → VM exit (expensive)
        → Hypervisor handles it (software)
        → Hypervisor talks to real NIC
        → Return to VM
    Every packet = VM exit = slow

NITRO NETWORK PACKET:
    VM sends packet
        → Goes directly to Nitro card (SR-IOV)
        → Nitro card handles it (hardware)
        → No VM exit!
    Packets bypass hypervisor = fast

Part 20: The Whole Picture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌─────────────────────────────────────────────────────────┐
│                    PROCESS                              │
│  Has: own memory view, CPU time, handles                │
│  Sees: "I'm alone, I have everything"                   │
└────────────────────────┬────────────────────────────────┘
                         │ thin API
┌────────────────────────▼────────────────────────────────┐
│                  GUEST KERNEL                           │
│  Has: all processes, scheduling, handles                │
│  Sees: "I control the hardware"                         │
│  Truth: it's all fake                                   │
└────────────────────────┬────────────────────────────────┘
                         │ VM exit (sensitive op)
┌────────────────────────▼────────────────────────────────┐
│                   HYPERVISOR                            │
│  Has: all VMs, real scheduling, real memory maps        │
│  Sees: actual hardware                                  │
│  Does: CPU + memory only (Nitro)                        │
└────────────────────────┬────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────┐
│                 NITRO CARDS                             │
│  Does: network, disk (real hardware)                    │
└────────────────────────┬────────────────────────────────┘
                         │
                     HARDWARE

Part 21: Resource Isolation — Which Resources Are Hard?

ResourceDedicate?Flush?Encrypt?Time-Slice?Risk If Shared
CPU time✓ pin coresN/AN/A✓ scheduler✗ noisy neighbor
Cache✓ CAT✗ slowN/AN/A✗ side channel
RAM✓ static partition✗ too slow✓ encryptN/A✗ data leak
Network✓ hardware queuesN/A✓ encrypt traffic✓ one at a time~ depends
Disk I/O✓ separate queuesN/A✓ encrypt blocks✓ one at a time~ depends

The Pattern: Stateful + shared = danger. Cache is the hardest because:

  • It’s shared (cores share L3)
  • It retains state (previous VM’s data)
  • Clearing is slow (flush penalties)
  • Can’t easily encrypt (it’s internal)

Part 22: The Nitro Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌─────────────────────────────────────────────────────┐
│                   EC2 Instance                       │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │
│  │  VM 1   │ │  VM 2   │ │  VM 3   │ │   ...   │   │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘   │
│       │           │           │           │         │
│  ┌────┴───────────┴───────────┴───────────┴────┐   │
│  │         Nitro Hypervisor (minimal)          │   │
│  │    - CPU/Memory allocation only             │   │
│  │    - "Firmware-like" — deliberately small   │   │
│  └─────────────────────┬───────────────────────┘   │
│                        │ PCIe                       │
│  ┌─────────────────────┴───────────────────────┐   │
│  │              Nitro Cards (hardware)          │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────────┐    │   │
│  │  │VPC Card │ │EBS Card │ │Storage Card │    │   │
│  │  │(network)│ │(block)  │ │(local NVMe) │    │   │
│  │  └─────────┘ └─────────┘ └─────────────┘    │   │
│  └─────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘

Key Insight: Nitro offloads I/O to hardware so hypervisor only does CPU+memory. This is the “secret sauce.”


Part 23: Official Names Reference

What I SaidOfficial (Terrible) Name
mode bitring / privilege level
privileged modering 0 / kernel mode
unprivileged modering 3 / user mode
ask kernelsyscall
hardware kickinterrupt
process state blobtask_struct
VM state blobVMCS
mappage table
map pointerCR3
two-level mapnested page tables / EPT
VM sensitive operationVM exit
return to VMVM entry
hypervisor modeVMX root / ring -1
one_level_deeper_privilegedring -1 / VMX root
guest_thinks_its_privilegedring 0 in guest
guest_did_something_sensitiveVM exit
return_to_guestVM entry
vm_state_blobVMCS
two_level_address_mapnested page tables / EPT

Part 24: How To Explain It

“How does memory isolation work between VMs?”

“Two levels of maps. Guest has its map — virtual to ‘physical’. But guest physical is fake. Hypervisor has another map — guest physical to real physical. Hardware walks both. VM-A’s guest physical 0x2000 maps to real 0x8000. VM-B’s guest physical 0x2000 maps to different real address. They can’t see each other. Hypervisor controls the second-level map.”

“Why does Nitro offload I/O to hardware cards?”

“CPU is general-purpose but slow for specific tasks. Every emulated I/O operation causes a VM exit — that’s expensive, like a syscall but worse. Dedicated hardware handles I/O directly. VM talks to card, not hypervisor. Hypervisor stays minimal — just CPU and memory. Smaller attack surface, faster I/O, better density.”

“How does VM isolation compare to process isolation?”

“Linux already isolates processes with page tables and ring protection. But processes share the kernel — one kernel bug, everyone’s exposed. VMs add another layer: each VM gets its own kernel, and the hypervisor sits where hardware sits. It’s the same pattern — ring separation, page tables — just one level deeper.”

“What about cache/side-channel attacks?”

“Shared resource with state = leak path. Cache is worst — shared, stateful, slow to clear. Options: dedicate cores so cache isn’t shared, flush on switch, or hardware partitioning if available. Nitro probably dedicates cores for sensitive workloads. It’s the cleanest — no sharing, no leak. Costs density.”


Part 25: Summary — Sketchy But Correct

QuestionAnswer
What gets saved on interrupt?All registers. CPU saves some automatically, kernel saves rest.
Where is it saved?Each process has a kernel stack. State goes there.
Can process prevent interruption?No. Timer is hardware. cli is privileged.
How is scheduling fair?Kernel tracks runtime per process. Lowest runtime goes next.
How often does switching happen?Timer every ~1-10ms. Kernel checks and maybe switches.
What about mid-calculation?CPU finishes current instruction first. Then interrupts.
What’s the difference between Linux and hypervisor?Same patterns (rings, page tables, state blobs), one level deeper.
What makes Nitro special?Minimal hypervisor + hardware offload for I/O.
This post is licensed under CC BY 4.0 by the author.