Advanced Operating Systems

Table of Contents

Meta

Grade Breakdown

10participation
25homework
25project
22.5midterm
22.5final

class notes

2009-08-25 Tue

  • same course path as undergraduate OS, just much more detail
  • all reading in research papers
  • groups
    • all reading
    • all work
    • one day's lecture
  • research-lite project
    • proposal in ~3rd weak of Sept.
    • 1-2 month working on implementation
    • will produce a research-paper
  • undergrad lectures serve as good background for the course

2009-08-27 Thu

email server was down, try again or send email to Dorian

concepts

overview
OS
software providing access to hardware (cpu, memory, disk, IO)
policies
what user can do
mechanisms
how policies enforced
permission levels
often controlled by indicator bit
  • user
  • kernel
system call
allows access to kernel functions from user mode
  1. syscall is made
  2. parameters stored in registers
  3. switch to kernel mode
  4. execute routine defined in the kernel
virtual memory

virtual address space which can be mapped to actual memory. this allows the process using the memory to be loaded/unloaded/moved etc…

data/virtual-memory.png

if a page of virtual memory is not in physical memory a page fault occurs and the page is loaded into physical memory

working set / footprint

of a process is the parts of it's address space currently in use. this is the pages of memory that need to be loaded in physical memory to avoid memory.

design goals

efficiency
often a tradeoff between time and space
robustness
fulfills expectation of users
security
hardware interface
expose features/capabilities of hardware
user interface
present features/capabilities to user
portability
target hardware
economics
development cost, user base
scalability
range of supported hardware/user sizes/numbers
extensibility
ability to support new components

papers

keep in mind the context

  • users back then were developers
  • cpu used to be bottleneck now it's memory

increasing gap between CPU speed and the ability of memory and bandwidth to keep up.

  • bandwidth is proving the limit on the amount of memory which can be used efficiently
observations
hints

2009-09-01 Tue

Stephens Richie and Thompson developed unix and TCP/IP

  • Stephens
    • part of the unix team
    • wrote the unix bible

2009-09-03 Thu

uni-programming
only one process at a time, typically it would run to completion
batch-programming
still uni-programming but you maintain a queue of processes that are ready to run
multi-programming
allows multiple processes to run "simultaneously" on the machine using preemption, time slicing and by utilizing different hardware components in parallel
time sharing
multi-programming with multiple users creating processes. these days, tend to make the most sense for large batch processes, rather than interactive use

Mechanisms for multi-programming

context switch
switching processes on a CPU
process table
maintained by the OS, this contains an entry including a process control block for each process currently running on the system
process control block
contains the PID, the files the process is using, the program counter, register values, a pointer to the image
image
when you are about to run a process you load that process which creates an image of the process and bring it into memory. image+state=process. program -> image -> process

file systems

file
in general to the OS a file is just an uninterpreted raw ordered set of bytes, some specialized OSs do differentiate between file types for optimization.
directory
list of files, most OSs limit access to these files to system calls

mechanisms

  • filename relates to an index # which points to the index table which relates the index # to an i-node
  • when mounting a new disk the first couple of bites on the disk contain the information used by the OS to populate the index table
  • generally corrupt disks are the results of damage to this meta data section on the front of the disk

links

hard link
the contents of the file is a pointer to the i-node of the file
soft link
the contents of the file is the name of the file to which it is pointing

deleting a file

  • the actual data isn't "erased", rather the link counter in the i-node is decremented, and if there are no more links, then the blocks on which the file is written are added to the free list
  • deleting a soft link doesn't change it's target's i-node

2009-09-08 Tue

processes
unit of work user gives to the OS
thread
finer unit of work inside the process

processes

schedulers

batch queue
outside the OS, waiting jobs
short term scheduler
many different scheduling policies
  • round robin
  • priority
  • shortest-first

data/process-context.png

scheduling

scheduling policy

dispatcher
implements the scheduler policy
goals
timing
responsiveness
time to first response
waiting
total time spent waiting
turnaround
time start to finish
resource utilization
users don't really care about this

policies

fifo
first in first out
round robin
move around giving everyone time slices
shortest job first
theoretical, not in real life
priority
not a complete policy, combined with fifo or round robin etc…
multi-level queues
semantics added to queues (i.e. system, interactive, batch, IO-bound, etc…)
multi-level feedback queues
jobs can change priority over time, based on things like increasing the priority of long-waiting jobs to avoid starvation

threads

multiple processes sharing an address space

each thread would have it's own

  • UID
  • stack
  • registers
  • defining handling signals like C-c, segmentation violations, etc…

everything else is shared (text, static data, dynamic data)

threading schema

many-to-one (user-level threads)
don't really speed up execution, but could help to modularize a program
  • if one thread blocks for IO, all threads are blocked
  • thread operations (creation, deletion) are performed in user-space which makes them faster than having to do them in kernel space
one-to-one (kernel threads)
  • real parallelism
  • less portable
  • takes longer to create a thread
  • real parallel execution
many-to-many (hybrid)
static
static mapping between user-threads and kernel-threads
dynamic
thread mappings can be changed
pool
you can establish a pool of kernel threads, then do all further thread operations in user-space (faster)
limit
you can limit the number of kernel threads to something reasonable (number of processors) reducing overhead on the OS and times jumping the user|kernel boundaries
complicated
this is the most complex of the threading schemas
fixed
most often you have a fixed number of kernel threads

inter process communication

through sharing or message passing. each of these could be implemented in terms of the other

in both cases

  • synchronous == blocking - asynchronous == non-blocking

message passing

  • µ-kernels can be though of as using message passing
  • messages typically have to pass through kernels
  • can cross machine boundaries

shared memory

  • monolithic kernels can be thought of as using shared memories
  • typically faster than message passing
  • requires shared physical media
  • when two processes try to access a variable at the same time
    1. i read
    2. you read
    3. i process and write
    4. you process and write
    5. we've both missed the other's changes
  • atomic operations, locks, semaphores (binary, counting), etc…

2009-09-10 Thu

talking about µ-kernels.

address space == process

L4 kernel has hierarchical address space

  • every process inherits address space from a parent, and the initial address space (sigma-0) maps directly to physical memory
  • which is like monolithic unix where everything descends from the init process.
  • manages address spaces

paging vs. contiguous allocation

contiguous allocation
has a base and offset registers which are used to map the virtual addresses to the physical disk
paging
pages of memory map randomly (not contiguous) on disk
  • more complicated translation from virtual to physical addresses
  • allows you to fill holes one disk (finer granularity because physical memory is eaten in page-size chunks rather than address-space chunks)
  • allows portions of the address to be loaded individually as opposed to contiguous allocation where the entire address space must be loaded before any execution can take place

1st v.s. 2nd generation µ-kernels

  • second tailored more to hardware
  • second built from scratch (rather than back from monolithic kernels)

interrupts (µ-kernel is slower)

main reason µ-kernels are slower is because every interaction is translated through IPC which has to go through the table

monolithic kernel
hardware interrupts in a monolithic kernel are directly looked up in a register table
µ-kernel
one thread waiting for each potential interrupt source

top-half / bottom-half interrupts (in linux)

tradeoff between speed of handling interrupts, and need to do significant amount of processing in many cases

top-half
responds quickly and does what needs to be done immediately
  • for example it will just record that the interrupt occurred
  • has high priority and can interrupt other interrupt handlers
  • setup
bottom-half
does the actual bulk of the work
  • has lower priority
  • service

system call mechanisms

.so shared object
can be shared across multiple address spaces
.a static shared code
statically linked (shared) at compile time
trampolines
jumps the code execution to somewhere else, then jumps back

scheduling

in L4Linux the normal linux scheduler is used much like a many-to-one thread scheduler. Maps all of the linux "user-threads" to a single kernel thread.

the L4 kernel is scheduled using hard priority round robin

L4 scheduling priority levels:

  1. top-half interrupt handler
  2. bottom-half interrupt handler
  3. kernel (which is the linux server)
  4. user

translation look-aside buffer (TLB)

Fast associative memory that helps in address translation

It maps a virtual address to a physical address, if you hit in the TLB, then you don't have to look up the page.

tagged TLB
like a TLB plus information as to which process the address belongs to

in a normal TLB you have to flush all entries in the TLB to clear old mappings, in a tagged TLB you don't need to flush the TLB on context switch. this saves time which quickly switching between processes and back.

dual space mistake

tried to facilitate speedy kernel <-> user IPC through shared memory

  • space costs (doubling memory usage)
  • synchronization costs (takes time)

co-location

allows multiple processes to all have access to kernel memory, like threads

2009-09-17 Thu

Project Ideas

  • System monitoring stuffs
    • DynInst
    • KernInst
    • PIN
    • PAPI (Performance API)
  • massively parallel stuffs
    • map-reduce
    • hadoop, w/PIG
    • MRNet large scale group operations
    • RPC
    • XML-RPC
    • task-farming (programming model, issue tasks to the farm and collect results, example SETI-at-home)
  • threads
    • end-to-end threading model
    • deadlock prevention
  • file systems
    • encrypted
  • process hijacking
  • project HAIL
  • FUSE (MAC-FUSE)
  • Amazon Dynamo

exokernels

secure bindings
pain to write to

the spirit of the exokernel is that you would normally use abstractions exported by the library OS rather than always having to use your own

µ-kernel vs. exokernel vs. monolithic-kernel
  • Monolithic kernel
    +------------------------+
    |   S1   S2     S3       |
    |                        |   Monolithic kernel
    |      S4    S5          |
    +------------------------+
               ^              
              App   
    
  • µ-kernel
    +------+
    |App   |                 +------+
    |      |-\               | App  |
    +------+  -\             |      |
          +-------------+    +-/----+
          |             |    /-
          |             |  /-
          |  u-kernel   |--
          +---/-----\---+
             /       -\
           /-          --\
          /          +---------+
      +--/-----+     | S2      |
      |S1      |     |         |
      |        |     +---------+
      +--------+
    
  • exokernel
    +----+ +----+ +----+
    |App | |App | |LOs |
    +----+ +----+ +----+
    +-------------------+
    |    exokernel      |
    +-------------------+
    |    Hardware       |
    +-------------------+
    
downside (cooperation)

when each application has direct access to the hardware it becomes difficult for applications to cooperate (intelligently share resources), which is routinely done in standard kernels.

2009-09-22 Tue

monitors

(see ../cse451/notes/2007-10-17)

software construct which provides for mutual exclusion around a resource. maintains the invariant that when entering a monitor there is no-one else inside of the monitor.

condition variables
allow communication between processes avoiding spinning (while(condition !true);)
  • signal() alerts other processes
  • wait() sleep (relinquish cpu) and wait to be signaled
semaphore
semaphores are effectively equivalent to condition variables
  • sem.p() waits on the semaphore
  • sem.v() signals those waiting on the semaphore

Synchronization Problems

Mesa Monitors

Mesa

  • programs comprised of modules
    • clear API boundary between modules
      • public interface
      • private procedures

Mesa Monitors

  • monitor module
    • entry procedures (publicly interface)
    • internal procedures (private procedures)
    • external procedures (procedures that require no locking)

Issues

  • when in a monitor (in function foo), and you call a function bar in another module then during the execution of that function you are not in the monitor (bar has no access to the structures of the monitor while as it is another module)
    • if you don't release the lock when moving into bar
      • you have the risk that something in bar tries to grab the resource protected by the monitor /deadlock/
      • you have to unwind and open locks if say there is a deep exception
    • if you do release the monitor while calling bar you need to
      • ensure that you get the monitor back after executing bar
      • potentially do cleanup before/after executing bar
  • this is tradeoff between simplicity (per class) and efficiency (per object), best option really depends on the use case. monitor per class is sort of a strawman
  • it is possible for lower-priority processes to run in front of higher-priority processes.
    1. p1 acquires l1
    2. p2 preempts p1
    3. p3 preempts p2
    4. p3 acquires l1 (but can't because p1 has the lock)

    the problem is that p2 will run in front of p3, because p1 can't run and release the lock until p2 has run to completion

    priority inheritance
    associate a priority with a resource (lock) and the priority of that lock is set to the highest priority of those processes waiting for/on the lock. the priority of the process inside the lock is set to the lock's priority.

Difference between Mesa and Hoare

  • in Hoare you are guaranteed that immediately upon signaling of a condition variable the waiting process will receive control, however in Mesa monitors the signal is more of a hint and you are not guaranteed to receive control when a signal is sent.
    • Hoare
      if(!cond){condition_v.wait()}
      
    • Mesa (must re-check after condition becomes true)
      while(!cond){timed_wait()}
      
  • Mesa
    timeout
    abort
    broadcast vs. signal
    naked notify
    allows hardware interrupt to signal a condition variable without first acquiring the monitor lock. this is more efficient than forcing a device driver to wait for a lock to be released before accessing a monitor.
    • this could lead to a problem where a device signals that a resource is free, but the notification is missed by a process which is just switching from !cond to wait()
    • note that this only allows the hardware interrupt to signal the condition variable, not to actually touch the resource
deadlock

requires

  1. circular wait
  2. mutual exclusion
  3. no preemption
  4. hold-and-wait

2009-09-24 Thu

scheduler activations

wherever their system does well they present the numbers in a table. when their system doesn't fare so well they embed the numbers in the prose.

virtual processors to real processors
how do virtual processors map to real processors
SMMP
shared memory multi-processors, many CPUs which all have access to a single big block of shared memory
normal user-level threads
may not really be that much faster than kernel level threads (at least not to the point that this paper claims)

data/normal-user-thread-setup.png

these user-level threads
in this paper when they say user-level threads they mean the following model

data/sched-act-user-thread-setup.png

kernel scheduling
priority levels and equal access for each priority level
lifo
when there are not enough processors to run all threads, then they follow a lifo policy to take advantage of cache locality (if I was running recently then my cache is still around)
critical section
need to be careful not to preempt a thread in a critical section (or at least let it get back quickly)

data/critical-section-address-space.png

when a guy in a critical section is preempted and other people are waiting for his lock (and they've pushed him down the lifo queue) then you could deadlock (as he can't get up the queue until they finish and they can't finish until he runs)

solution
make a copy of each critical section which ends in a jump to an upcall. when the kernel preempts a process the kernel checks if the code is in a critical section, and if so, it jumps to the copy which is guaranteed to jump to an upcall when the section completes
spin lock
burns CPU, but keeps a process on the ready list, good for short wait, or when you have processor to burn
upcalls
used for the kernel to talk to the user process
  • preemption
  • adding procs
  • blocking
  • unblocking
downcalls
when the user-space communicates to the kernel
  • more procs
  • less procs

2009-09-29 Tue

lottery scheduling

a proportional-share-scheduling system where each entry (waiting consumer/process) gets some number of tickets, and whenever a resource is to be consumed a lottery is held and the winner's ticket is taken and the winner is placed in control of the resource.

specifics

  • actually help many lotteries at once forming a queue (rather than a lottery every time quanta)
  • processes can give their tickets to other processes (i.e. client server model, client could give tickets to server)
  • given to processes that release the CPU before their time quantum has expired
  • more uniform stat distribution with more samples -> smaller quantum leads to more samples
  • tickets can be used for any resource
    memory management
    reverse lottery, when a page needs to be evicted from memory a lottery is held to select page to remove

resource containers

aimed at implementing a web server

relevant metrics for web server

  • client metrics
    • response time
    • throughput
  • server metrics
    • number simultaneous clients
    • quality of service, might want different levels for different clients

resource containers allow the application to specify resource containers and tell the kernel how to assign resources to the resource container.

mechanism of resource containers

  1. connection comes in and is wrapped in a resource container
  2. thread handling that connection is bound to the resource container
  3. additional resources (i.e. file descriptor) are bound to the resource container

this can be useful for handling malicious requests (i.e. if they're tagged as malicious on the way in they can be given little/no resources)

memory management

handling the speed/capacity tradeoffs of memory maintaining

  • performance
  • protection
  • correctness
                /\
               /  \                |
speed ^       /    \      capacity v
      |      / reg. \
            /--------\
           /          \
          /  cache     \
         /--------------\
        /                \
       /   main memory    \
      /--------------------\
     /                      \
    /     local disk         \
   /--------------------------\
  /                            \
 /  cloud, remote disk, tape    \
----------------------------------
relocation

addressing schemas (w/static relocation)

source code
symbolic representation of memory addresses
compiled code
relative refs (e.g. module x + offset)
loaded code
absolute addresses

so to change where the code is located in memory you will generally need to reload the code. dynamically relocatable code has it's absolute addresses resolved at runtime rather than at load time, so the code can be moved without reloading the code.

allocation
contiguous allocation
simple (base, limit, attr). makes context switches very simple (the kernel only need to change the base and limit registers)
external fragmentation
may not have enough contiguous free space
sharing
can't share w/o sharing entire address space (no portions)
setting attributes
same as above, can't identify parts of the space
segmentation allocations
divide address space into segments of arbitrary size. segment number -> (base, limit, attr)
external fragmentation
because with variable length sizes there could be many free spaces which aren't big enough to be used
paging
(most popular) fixed size segmentation. this ensures that there is no external fragmentation (if there is any space available then it is page sized and can be used). this is still vulnerable to internal fragmentation

page table

data/memory-management.png

2009-10-01 Thu

2009-10-06 Tue

disco (implementation & performance)

OS modifications

  • drivers for DISCO specific "hardware"
  • changes to keep OS from trying to access a small chunk of unmapped memory
  • (small) allows the guest OS to request a 0'd page (so the guest OS doesn't have to re-0 a page)
  • (disco) interprets the guest OS going into low power mode as the OS yielding the processor

virtual memory

Multics
  • since segments are organized/structured as files they actually didn't have a file system. referencing a segment through it's symbolic name is like referencing a file
  • seg.tag | address | opcode | external | addressing-mode
    seg. tag
    points to the base register of the owning base register
    external
    whether to use the segment tag (if external) or your own base register
  • address points to another address. happens when you have multiple levels of paging hierarchy.
    • indirect address points to 2 36-bit words, the new segment number and the new word number
  • reference to external program
    • symbolic name -> module name
    • symbolic address -> function name or variable name
  • are added to each process to hold the lookup information for external segments. after an initial reference the number of the link in the linking segment is used for future references.
VAX
VMS addressing
2-bit seg. | 21-bit page number | 9-bit offset
segments
system space, program region, control region
program region
user data for the program
control region
kernel data for the program
TLB
the TLB is split in two (system/process), less has to be flushed on context switch

2009-10-08 Thu

VM pros and cons

  • pros
    • larger address space
    • convenience in segmentation and paging
    • code portability
  • cons
    • (time overhead) increased effective memory latency
    • (space overhead) maintaining mappings, page tables
    • increased complexity

2009-10-20 Tue

disks and file systems (see related 481 slides on Dorian's homepage)

disks

disks
stack of platter of concentric circles (or tracks) of sectors along with a movable arm (in all modern systems) and there is one arm/head per platter. each platter (aside from top and bottom) has data on both sides.

data/disk-os-stack.png

file system

semantics on top of disks

abstractions

  • files
  • directories

handles

  • permissions
  • mapping abstractions to disk
  • enforcing resource quotas

directory

  • just a special file which consists of a list of entries
    • directory entry contains: filename, id, inode-#
  • certain operations (cd, ls) can only take place on directory files
  • organizations (in increasing complexity)
    • 1-level directory
    • 2-level usernames/files
    • trees (graph with no cycles)
    • acyclic graphs (sharing: multiple links to the same content)
      soft/symbolic link
      the file just maps to the name of another file (allows dangling pointers)
      hard link
      actually copies the inode-#, an inode (and the file) is removed when there are no more hard links pointing to the inode. this information is tracked in the inode
    • general graphs

filesystem on disk:

  • boot control block
  • volume control block
    • # of blocks
    • # free blocks (list)
  • directory structure
    • starts @ root disk
    • filenames, inode-#s
  • file table
    • maps inode-#s to inodes

when a device is mounted the OS loads the filesystem structures into memory

filesystem in memory:

  • mount table
  • cache directory structure
  • open file table (another cache)
    • variations: system wide or per process (know the pros and cons of each of these options)
  • caching (pages/contents of the files)

2009-10-22 Thu

going through the midterm

Grade Distribution

meanmedmaxmax possible
p123253030
p211121415
p319202525
p48.7101515
total61618485

Review of problems (in general on the exam less is more)

      1. exokernel, library OSs are linked into the application space of the application, so the getpid call is just a function call in the user-space which would not have to cross the user/kernel boundary, so this would be faster than the monolithic kernel
      2. protection, multiplexing, IPC

3) 1)

      1. it is much more complicated to move a process than to move a block of data. if you have many readers/writers of a block of data it may make more sense to move the users to the data rather than moving/replicating the data.
      2. in message passing structured IPC using copy-on-write can allow pointers to be passed from process->kernel->process rather than the actual block of data

2009-10-27 Tue

will discuss LFS and RAID on Thursdays

LFS
wanted to improve performance and ended up improving filesystem reliability
RAID
vise-versa

Network Files System (NFS)

remote file access

  • pros
    • larger file servers (capacity)
    • sharing
    • robustness / redundancy
  • cons
    • speed (latency)
    • availability
    • consistency
    • complexity
  • NFS specific goals
    • 80% speed of local disk
    • simple crash recovery
      • can repeat operations until success (idempotent). many operations are not naturally idempotent, for example the read operations read(f, out, nbytes) would normally increment a counter in the file, in nfs this counter must be tracked on the client side and passed as a parameter to the server
    • no state on the server
    • transparent access
    • preserve Unix semantics
  • Deployment Issues
    • sharing the root file system
    • scalability/performance sharing heavy use files (e.g. binaries required on startup)
      • made these files local to each individual node
    • /tmp files (use the process ID, which wouldn't be unique across different nodes)
    • /dev entries of this directory have local semantics which make no sense to access on a remote system
    • authentication across machines (need a global system of user IDs, "Yellow Pages")
    • concurrency: local locks but no global locks, so two users on different nodes could have their writes to a file interleaved.
    • performance: (solution is always caching)
      • calls which occur often, but transfer small bits of data, (e.g. getattr which is called by ls, and pretty much every file access, this was initially 90% of the transactions) – so, they just cached attributes, this cache is invalidated every three seconds for files and thirty seconds for directories
      • used UDP (Unreliable Data Transport), so if a packet in a RPC is lost they'd just redo the RPC
      • really big packets
        • read-ahead to try to get blocks before their needed – this doesn't help for executables with random access patterns

data/nfs-structure.png

  • VFS (virtual file system) abstraction on top of the specific file system used. allows file systems to be plugged in sort of like device drivers
  • XDR is used as a canonical data representation ensuring that when the client and server share objects (ints, arrays, etc…) they cache their objects out into bits in the same way (endianness, float representations, etc…)

2009-10-29 Thu

(if we are ever really interested in a paper we could lead that lecture)

disk failures

updates, 3 parts – related to disk failure

  • (D) data blocks
  • (F) free blocks
  • (M) meta-data blocks

disk failure part way through a write could lead to incoherence in the three above. most FS will perform the above in such a way the any inconsistency is a "functional" inconsistency – while space may be wasted everything will still "work".

some crash cases

(D) -> crash
no real problem, just wasted time writing to a block that's still on the free list
(D) -> (F) -> crash
leaked a data block that will not be recovered
(F) -> (M) -> crash
functional problem, file points to whatever was previously on disk (garbage or someone else's old data)

fsck checks that

  • all blocks not on free list are in use – referenced by an inode
  • all blocks referenced by an inode are not in the free list

journal/log-structured differences

  • journal – transactions in progress which can be used to recover from crash/failure
  • log structured FS – actually uses the log as the only structure on disk

RAID / LFS

writes are buffered in main memory until there is a segments worth of data to write to disk. This allows the entire segment to be written w/o any seeks taking advantage of the disk's full bandwidth.

in RAID there is a slowdown factor of N when writing to N disks.

in LFS the checkpoints become the journal

RAID levels (5 and 1 are the only common levels)

  1. striping across a single disk
  2. straight mirrored disks, faster reads as you can read from both disks and whichever returns first wins (best case seek), for a write you have to wait for the write to complete on both disks (worst case seek)
  3. Hamming code for ECC
  4. Single check disk per group
  5. Independent read writes
  6. No single check disk (large performance increase over RAID level 4)

note know the basic read/write operations for each level and be able to discuss the performance implications

2009-11-03 Tue

LFS and RAID

LFS
main point is the caching setup. user <-> cache <-> disk
RAID
don't need to know the names of the specific levels, but should be able to derive the mechanisms for reading/writing, as well as the implications speed/reliability for these mechanisms. RAID can be implemented in hardware or software. Be able to extend these concepts (e.g. RAID 7 is )
0
block-level striping
1
simple mirrored disks
  • read: could use either disk (faster), for a multi-block read each disk could serve up different blocks
  • write: will necessarily use both disks
5
block-level striping and distributed parity – parity is spread across all disks
  • read: will either only touch the specific disk which the block lives on, or will read all disks (including parity) and will reconstruct the data
  • write: must touch all disks, writes to the disk on which the data will live and to the parity disk and reads from the other disks to calculate the parity

blocks and sectors

block
software construct, typically will be equal in size to either a single sector or multiple sectors
sector
the actual size sections of the physical disk

CODA

  • call backs are used in asynchronous operations, they alleviate the need for active probing. allows the server to alert the client when a change occurs – used in CODA for cache coherence

2009-11-05 Thu

general consistency

by and large message passing has beat out shared memory when it comes to distributed computing. MPI is the de-facto distributed memory standard openMP is a new message passing alternative.

typically there is no global clock

strong consistency
(called sequential consistency in Munin paper) any write is immediately visible to subsequent reads
causal ordering
uses communication between processes to determine a global partial ordering
weak consistency
this is not really ever used. makes no guarantees that writes will be visible to future reads
eventual consistency
write will eventually be seen
release consistency
requires data to be visible only at certain synchronization points (i.e. at release or barrier)

Munin

Munin – shared program variables are annotated with their access pattern which is used by the OS

data/dist-memory-arch.png

barrier
designate a point where you will wait at that point until every other thread gets to that point
split-phase barrier
two checkpoints, everyone can pass the first checkpoint arbitrarily, but no-one passes the second checkpoint until everyone has passed the first

data/split-phase-barrier.png

Munin Annotations and Protocol Parameters

annotationsIRDFOMSFIW
read-onlyNYN
migratoryYNNNNY
write-sharedNYYNYNNY
producer-consumerNYYNNYNY
reductionNYNYNNY
resultNYYYYYY
conventionalYYNNNNY

Meanings of Parameters

Iinvalidate or update
Rreplicas allowed?
Ddelay vs. immediate
FOfixed owner?
Mmultiple writers allowed?
Sstable sharing pattern?
FIflush changes to owner
Wwritable?

Non-functional performance enhancing objects

  • ability to map an object to a lock
  • ability to explicitly flush changes to an object

Implementation

  • maintained a hash table mapping object addresses to their attributes
  • copyset was a list of where (which processors) an object currently exists
  • delayed update queue (DUQ) to hold updates which will need to be propagated, generally held until barrier and then sent to everyone in it's copyset

question: why only use twins when there are multiple writers?

2009-11-10 Tue

Munin implementation

  • DHT or Distributed Object Directory
  • delayed update queue
    • page twins: two copies of a page used to find out what the differences are between old/new versions of the page
  • distributed locks were effectively a queue, person at the front owns the lock and everyone else is further down the line.

page faults used to track updates

  1. write protect pages that process would normally be able to write to
  2. when page faults allow write to go through but make a note and maybe update remote copies of the page

Quicksilver

transaction
collection of operations into a single atomic unit of consistency and recovery. techniques include…
  • locks
  • mutexes
  • semaphores
  • monitors
  • h/w instructions
  • interruptible disabling
commit protocols
some things to be considered as goals
  • atomicity
  • recovery semantics
  • minimize overhead
    • blocking/sync
    • logging overhead
    • communication
two phase commit
coordinator and subordinates
transaction_begin
1
2
3
...
transaction_end
  • the coordinator
    1. initiates the transaction
    2. prepare message is sent to all subordinates
    3. subordinates act and respond
    4. send commit
  • the subordinate
    1. upon receipt of prepare message the subordinates reply with either yes or no
    2. no -> veto
    3. or go to prepared state and update logs and respond yes

2009-11-12 Thu

Quicksilver

locks used to make a monolithic unit out of a series of operations

short lock
would only be held for a single operation inside of a transaction
long lock
could be held for an entire transaction
short lockslong locks
read
write
degree 0 consistency
short write lock and no read lock
  • cascading abort
  • dirty reads
  • non-repeatable reads
degree 1 consistency
long write lock and no read lock
  • dirty reads
  • non-repeatable reads
degree 2 consistency
long write lock, and short read lock
  • non-repeatable reads
degree 3 consistency
long write lock, and long read lock

locks in the context of their DFS (Distributed File System)

  • directories
    • locks for renaming, creating, deleting
    • write lock for dir.entries
    • no read locks
  • files
    • short read locks and long write locks
highlights (distinguishing features)

distributed OS using transactions for data consistency

wrapped applications in trivial transactions, so bad quit would remove all previous changes

in order to share a transaction with another process you would need to fork that process

Cluster Based Scalable Network Services

advantages

  • small unit of fault -> robust
  • scalable
  • cost effective

BASE

  • Basically Available
  • Soft state
  • Eventual consistency

Condore is another system that finds idle machines and sends them work when work accrues

implementation

components of the system

  • front end
    • http server
    • thread pool
  • workers
    • to provide services
    • to hold the results of computation
    • report failed services to the manager
  • manager
    • calculates load and sends requests to the front-end
    • receives failure reports from workers

failure peers vs. failure pairs

failure peers
manager watches front-end and restarts if it crashes and vice versa
failure pairs
more generally called hot backups where each component has a backup which can take over if one fails

2009-12-01 Tue

cover CFS and do Map-Reduce on Thursday, presentations starting next week

final

  • lets try to do a final-review outside of class
  • final will sprinkle questions over the first half, but will focus on the second half

project

  • paper is due at the end of next week 2009-12-11 Fri 22:49
  • 10-12 minutes per group – 8-10 slides

CFS

cfs-reading

lookup
(finger table and successor list) the successor list was slow because on average you would have to touch half the servers in the system, so the finger table was added to store IDs of far away people for quick jumps to distant portions of the circle.
caching & timeout

2009-12-03 Thu

map reduce

  • stream programming collection of filters which the data passes through
           +---+
           | F |
           +---+
          /-   -\
         /       -\
       /-          -\
    +-+              +---+
    |F|              | F |
    +-+              +---+
       -\         /--
         -\     /-
           +---+
           | F |
          /+---+\
       /--       ---\
    +-+             +---+
    |F|             | F |
    +-+\           /+---+
        \         /
         \       /
          \     /
           +---+
           | F |
           +---+
    

data/google-map-reduce.png

consistency
can handle failures in workers (just aborts if master happens to fail) by repeating the computation for failed workers. this mains that the worker tasks can happen multiple times – so they must be idempotent (i.e. side effect free). also the computation would need to be deterministic for re-doing of failed nodes to have no effect.
backup tasks
only as fast as your slowest worker – so as workers finish the unfinished tasks are duplicated to idle workers in the hopes that someone new will finish the task earlier
combiner function
can be run on the local map worker to compact the data before it is sent of to be reduced
skipping bad records
when some records continually cause workers to fail then they will be skipped
local execution
ideally workers will be selected which are close to the data which they will be analyzing

reading notes

Map / Reduce

peer-to-peer

Amoeba vs. Sprite

Douglis91Comparison.pdf

both truly distributed operating systems in contrast to most of today's large distributed system which has node-local OSs with a global managing agent.

Network Services

Munin DSM

NFS network file system

log file system

fast file system

McKusick84FFS.pdf

Old FS: (order on disk)

  1. superblock
  2. inode blocks: direct (first 8 blocks) v.s. indirect blocks
  3. data blocks: size (initially 512 then up to 1024)

issues with this setup

  • inodes not located near the data, so many non-contiguous jumps
  • issues with fragmentation
  • didn't take advantage of the structure of the disk (too much random access of the file)

New FS:

  • collocated inode and file data (in the same cylinder group)
  • replicate the superblock information across all cylinder groups (reliability)
  • variable block sizes (4k block size has average 2k internal fragmentation)
    • split each block into anywhere from 1-8 fragments (powers of two) and managed free space on a fragment (rather than block basis). this can incur bookkeeping and overhead problems (as a file increases in size it may need to be continually copied between fragments and blocks).
  • exploit h/w characteristics by trying to adjusting notion of "contiguous" based on the speed with which the disk can move between segments
  • collocate directories and files

VM in Multics

goals

  1. provide the user with a large virtual memory hiding moving of data between levels, and any machine-dependent stuffs
  2. allow procedures to be called by name w/o any need to plan for the storage of the called procedure
  3. permit sharing of procedures and data among users subject only to permission restraints (vital to efficient operation in a multiplexed system)

process, address space

processes and address space stand in a one-to-one correspondence

address space is composed of variable length segments, each segment is either data or procedure which affects it's access permissions.

segments are addressed using a directory structure similar to files.

addressing

generalized address
consists of a segment number and a word number
address formation
based on values of processor registers, different for process/data segments
process
segment number in procedure based register + the program counter
data
the segment tag of instruction selects a base register if the external flag is on. otherwise the segment number is taken from the base register
indirect addressing
in this case the generalized address is used to fetch two 36-bit words, these are combined to form another generalized address. can be nested
descriptor segment
generalized-address -> main-memory is done using a two-step hardware lookup
paging
of segments allows non-contiguous segments of main memory to be referenced as logically contiguous generalized addresses

intersegment linking and addressing

shared access and building upon others addresses are both important goals of multiplexed machines

requirements

  • pure procedure segments execution can't change their content
  • symbolic procedure calls without making prior arrangement for the procedure's use
  • segments of procedure invariant to recompilation of other segments

implementation

making a segment known
when the segment is called by symbolic name it is added to the caller's description segment and can later be referenced by number
linkage data
a processes code must be invariant to compilation, so the process will always use a segments name/path to address it. after the segment is known, then it's number can be used. a linkage segment will hold the information on name/path -> number transformations so that the numbers can be used for known segments w/o changing the contents of the process

VM in Vax

Levy82VAX-VMS.pdf

process & virtual address space

page number and offset within the page

address space divided into spaces (not segments)

system space
high-address half is system space and is shared across all processes. This contains OS stuff, executive code and protected data.
process space
low-address half (for the process)
program region (P0)
low-address half of process space. contains the user's executable program. first page is reserved to cause errors on 0-address references
control region (P1)
high-address half of process space. this region is used to hold process-specific data

each space/region has it's own page table

system space page table
in hardware, not swapped on context switch
process tables
in the system-space, are swapped on context switch

memory management

paging issues

  1. effect of heavy pagers on other processes
  2. high cost of startup/restart (by faulting it's way into main memory)
  3. increased disk workload of paging
  4. processor time searching page lists

pager and swapper

pager
OS procedure resulting from page fault
swapper
separate process which moves pages into/out-of memory

dealing with the above issues

  1. the pager deals with this issue by evicting pages from the process which is requesting the new page, so one process won't push out everyone else's pages. also a limit is placed on the number of pages a process can have in memory.
  2. the above helps with this as well
  3. the VAX clusters the reading and writing of pages to relieve I/O burden on the disk
  4. by not having a reference bit (used to mark recently used pages) the VAX system takes load (scanning page tables and setting these bits) off of the processor

when pages are removed they are placed on the free page list or the modified page list depending on their modified bit and whether they need be written to memory. these lists serve as physical caches for recently removed pages (it is quick to move a page from one of these lists back to the working memory).

by caching the modified pages in the modified page list the following for speedups are gained.

  1. caches pages for quick return to the process
  2. clustered writes (~100 pages on the development system)
  3. arranged on paging file so clustering read is possible
  4. many page writes are avoided entirely

additional structures

demand zero
when processes require new pages they are created and filled with zeros on demand
copy on reference
when multiple processes using a page

program control of memory

for real-time programs that need explicit memory control

  • expand it's P0 or P1
  • increase it's resident set size
  • lock (or unlock) pages in it's resident set
  • create/map sections into it's address space
  • record it's page-fault activity

lottery scheduling

Waldspurger94Lottery.pdf

(not required reading)

resource containers

Banga99ResourceContainers.pdf

(not required reading)

Scheduler activations

introduction

user threads vs. kernel threads

user threads
  • requires no kernel intervention
  • fast (on order of procedure call)
  • flexible
  • each thread runs on a "virtual processor" which still has to be multiplexed onto a real processor and interleaved with system calls, and kernel stuff leading to a performance hit
  • sometimes exhibit incorrect performance when involve I/O
kernel threads
  • directly maps each application thread to a physical processor
  • heavy weight
  • not a restricted (RE: side effects, I/O)

the goal of this paper is to combine user/kernel threads

  • common case (no kernel required) perform as user threads
  • acts as kernel threads when needs to talk to kernel
  • easily customizable
  • difficulty is that relevant information is scattered between kernel space and user address space

the approach described in this paper is to give each user-level thread system with it's own virtualized machine which can have any number of processors.

problems w/user threads over kernel threads
  • kernel threads must implement anything that any reasonable user-level thread system may need (too much overhead)
  • when a user-level thread blocks (for I/O, fault, etc…) it's kernel thread also blocks
  • if we create more kernel threads then there are processors then the OS must make scheduling decisions without any information about the priority / current-task / importance of the related user-level threads
design (scheduler activations)

each user-level thread system gets it own virtual multiprocessor

  • kernel gives processors to user thread systems
  • user thread system has complete control over use of it's virtual multiprocessor
  • user thread system can tell kernel when it needs more threads
  • user thread system only talks to kernel when it needs to
  • looks to the application programmer like they are using kernel threads
  • communication from the kernel to the user-level thread system which may cause it to reconsider it's scheduling decisions.
    • roles
      • serves as the vessel or context of the user-level thread
      • notifies user-level thread of kernel event
      • stores user-level thread when it's blocked (e.g. for I/O)
    • when a thread is stopped
      1. the kernel stuffs it into it's activation
      2. creates a new activation to tell the thread system that the thread has been stopped
      3. the thread system removes the thread, and tells the kernel the activation can be re-used
      4. the kernel does another upcall giving the newly released scheduler activation (processor) to the thread system to run a new thread on

      file:data/scheduler-activations-upcalls.pdf

    • there are all ways as many activations assigned to an address space as there are actual processors
    • in the same manner processors are moved from one address space (thread system) to another
  • how user-level thread systems keep the kernel informed about their amount of parallelism
    • inform kernel when more threads than processors
    • inform kernel when more processors than threads
  • when a thread is interrupted while in a critical section
    1. the kernel makes an upcall informing the address space that the threads processor is ready
    2. this upcall is intercepted and given to the thread until it is out of it's critical section
    3. the thread is then put back on the ready queue and the address space is free to respond to the new processor however it sees fit
implementation

implemented by tweaking

Topaz
the native kernel threads for the firefly machine
FastThreads
a user-level thread package
performance
  • same order of magnitude as plain user-threads
  • upcall performance is slow, much slower than normal kernel thread operations
    • written on top of existing kernel thread library (not from scratch)
    • written in higher level language (not carefully tuned assembly)
  • N-body problem
    • speedup with more processors
      • some increase over fast-threads
      • significant increase over kernel threads
    • more robust than fast-threads to lower amounts of memory
related ideas

psyche and symunix are both NUMA OSs which provide virtual processors similar to activation contexts.

differences

  • both psyche and sumunix provide for shared address space between kernel and thread systems
  • neither provides the exact functionality of kernel threads (for I/O etc…)
  • neither provides efficient system for user-level thread system to notify kernel when it's hungry
summary

combine the performance of user-level threads with the functionality of kernel-level threads. this is done by supplying each user-level threading system with a virtual multiprocessor in which the application knows exactly how many processors it has at any one time (and each processor maps to an actual physical processor)

  • processor allocation (between applications) is done by the kernel
  • thread scheduling is done by address space
  • kernel notifies address space of events affecting it
    • new processor
    • less processor
    • preempted thread
  • address space notifies the kernel if it needs more/less processors

Monitors (2)

Monitors: An OS structuring concept

Hoare74Monitors.pdf

  • monitors are procedures or functions called by software wishing to acquire a resources along with local administrative data
    monitorname: monitor
      begin.. declarations of data local to the monitor; 
        procedure procname (... formal parameters...) ; 
          begin... procedure body... end; 
        ... declarations of other procedures local to the monitor; 
        ... initialization of local data of the monitor... 
      end;     
    
  • a procedure will have to wait when the monitor is in use
  • when the program is waiting for the monitor, it needs to be sure that after the monitor is released, the very next procedure to execute will belong to itself
  • there are multiple reasons that a program will need to wait, so the program will have to set a condition variable to indicate that it is waiting for the monitor

example of a monitor (resource:monitor) with condition variable nonbusy

single resource:monitor 
begin busy: Boolean; 
    nonbusy : condition; 
  procedure acquire; 
    begin if busy then nonbusy.wait;
             busy : = true 
    end; 
  procedure release; 
    begin busy := false; 
          nonbusy.signal 
    end; 
  busy : = false; comment initial value;
end single resource 

the above example simulates a boolean semaphore with aquire and release procedures.

interpretation

a process inside a monitor may need to signal another process. the signaler must wait for the signaled to complete and to allow it to proceed, it can increment an urgentcounter to indicate that it had control of the monitor and should get it back.

then whenever the monitor is released, the urgentcounter should be decremented and the longest waiting process on the counter restarted.

similarly we need to be able to allow process in monitors to wait as well as signal which could be implemented similarly (with a waitcounter)

given the above the monitor can be explicitly passed form one process to another, and only released when there are no more processes in the explicit passing of control

bounded buffer example

two processes running in parallel share a bounded buffer, one is the consumer (eating form the beginning) and one the producer (appending to the end).

the following implements this setup

bounded buffer:monitor
  begin buffer:array 0..N - 1 of portion;
        lastpointer:0..N - 1;
        count:0...N;
        nonempty,nonfull:condition;
    procedure append(x:portion);
      begin if count = N then nonfull.wait;
            note 0 <= count < N;
            buffer[lastpointer] := x;
            lastpointer := lastpointer + 1;
            count := count + 1;
            nonempty.signal
      end append;
    procedure remove(result x :portion);
      begin if count == 0 then nonempty.wait;
            note 0 < count <= N;
            x := buffer[lastpointer - count];
            nonfull.signal
      end remove;
    count := 0; lastpointer := 0;
  end bounded buffer;
scheduled waits

sometimes rather than just selecting the longest waiting process from a variable we would prefer to allow processes to have some priority

real world examples
  • buffer allocation
  • disk head scheduling elevator algorithm
  • readers and writers (only writers need exclusive access)
    • to ensure writers can access elements, no readers can start while a writer is waiting
    • to ensure readers get access, all readers queued during a write are allowed to read before the next write operation begins
    • variables
      • startread
      • endread
      • startwrite
      • endwrite
      • number of waiting readers
      • is someone writing
conclusion

monitors can be an appropriate structure for an OS with parallel users

Experience with Processes and Monitors in Mesa

Lampson80MesaMonitors.pdf

Lampson and his team seem to make everything harder than it should be

issues

programming structure
must fit monitors into Mesa's module based organization
creating processes
need to be able to dynamically create processes after compile time (adds complications)
creating monitors
need to be able to dynamically create monitors after compile time (adds complications)
wait in nested monitor call
is confusing
exceptions
make Mesa's unwind functionality work well with monitors
scheduling
moving from recommendations to implementation proved difficult
input/output
again moving from theory to practice can be hairy
description
implementation

equal division between

runtime
implements the heavier rarely used stuff like process creation deletion
compiler
implements the various syntactic constructs and translated into built-in support procedures
hardware
directly implements the more heavily used stuff like scheduling and entry/exit
performance
ConstructTime (ticks)
simple instruction1
call + return30
monitor call + return50
process switch60
WAIT15
NOTIFY, no one waiting4
NOTIFY, process waiting9
FORK+JOIN1,100
conclusion

integration of monitors into Mesa was harder than anticipated given the amount of literature on monitors and the high level of Mesa, however, much work was done to implement monitors in such a way that they can be used as the sole concurrency construct for an entire OS/language.

questions
  • wouldn't it also be a problem if I'm in my protected block, and hardware barges in and takes over the resource (breaks the monitor invariant)

Virtualization

Commodity Operating Systems on Scalable Multiprocessors

comodity-os-on-multiprocessors.pdf

again cites the size and complexity of modern operating systems as limiting factor, this time in effectively utilizing massively multiprocessor machines.

rather than customize the OS this paper inserts a small virtual machine monitor between the OS and the hardware.

Demonstrated on the Stanford FLASH shared memory multiprocessor, with an experiments cache coherent non-uniform memory architecture or ccNUMA setup.

data/virtual-machine-stack.png

problem

hardware development moves very quickly, yet people like to bring all of their existing software (which is OS dependent) to this new hardware.

there is a need for quickly porting existing OSs to new hardware as this is the limiting factor in adoption of new hardware setups

virtual machine monitors

the virtual machine monitors serves as a thin layer between the hardware and existing comodity OSs (like windows NT or *NIX), exporting to each OS a set of virtualized resources which it is able to manage.

while the machine can communicate through standard external interfaces (NFS, TCP/IP), the monitor is able to efficiently assign resources across machines (i.e. one machine may get more memory if needed, etc…)

with small changes the OSs can explicitly take advantage of the shared memory between virtual systems (e.g. a database could put it's buffer cache in shared memory supporting multiple query servers)

the VM takes many burdens off of the OS

  • only the VM need scale to the size of the hardware
  • the VM can isolate separate OSs protecting from faults
  • NUMA memory management
  • in general handling hardware quirks
  • VM issues
    overhead
    • additional
      • exception processing
      • instruction execution
      • memory requirements
    • large structure duplicated for each OS (file system buffers)
    resource management
    the VM does not have high level information about the processing taking place, so it can't distinguish processing which is just the OSs busy loop from important calculations.
    communication
    looks like different OSs on the same hardware rather than each OS on it's own hardware, so
    • same file can't be open in two different VMs
    • same user can't start multiple VMs
DISCO (a virtual machine monitor)

DISCO is designed for the FLASH multiprocessor which consists of a collection of nodes arrayed on a high speed interconnect. each node contains a CPU, memory, and IO devices

Disco Interface

processors
exports a processor of the same type as those used by FLASH. OSs tuned to use disco can directly access some common processor functionality using special load/store instructions.
physical memory
exports continuous physical memory starting at 0, and handles all the NUMA stuff behind the scenes
I/O devices
provides each OS with the illusion of their own I/O devices. this means disco must intercept all I/O communication. again provides special instructions for disco-aware OSs to bypass this in special cases
  • DISCO provides a virtual subnetwork which the machines can use to communicate amongst themselves
DISCO implementation

general

  • as a multi-threaded shared memory program
  • the small code portion of DISCO is duplicated across processors so page-misses are all local
  • avoids linked-lists and other structures which perform poorly with caching

virtual CPU

  • for speed DISCO direct executes most instructions and only tries to intercept dangerous instructions (like TLB modifications)
  • runs in supervisor mode which is between kernel and user mode
  • monitor catches traps and simulates them to the VM

virtual memory

  • maintains machine-to-physical mapping
  • catches VM attempts to update the TLB and uses them to update it's own TLB
  • downsides which decrease performances
    • TLB used for OS code/memory
    • TLB flushed between CPU switches

memory management

  • tries to be smart
    • copies pages to the nodes where they are most used
    • duplicates read-heavy pages between nodes that use them
  • uses FLASH hardware support for counting cache misses per page and identifying hot pages

I/O devices

  • intercepts all devices access
  • add special DISCO device drivers into the OS
  • DMA map (translates physical to virtual address spaces?)

copy-on-write disks

  • multiple VMs can share pages in virtual memory
  • copy-on-write means that this is transparent to the machines
  • copy-on-write only makes sense for writes which will not be permanent or shared between machines
  • user files and persistent disks DISCO only allows one VM to mount the disk at a time (or using distributes file system protocol like NFS)
DISCO (commodity OS)

currently supports a version of UNIX (IRIX), most changes to the OS resided in the HAL (hardware abstraction layer)

the special load/store call mentioned earlier to avoid traps are implemented in the HAL

experimentation

all takes place on SimOS a machine simulator

conclusion

developing system software for shared-memory multiprocessors, and more generally for new hardware.

DISCO shows that many of the performance limitations of VM setups are no longer an issue (sort of).

although software and OSs are growing in complexity the hardware-interface has remained relatively simple. supporting new hardware through a thin VM monitor such as disco is simpler and easier then rewriting the OS.

question
DMA
what is it?

Xen and the Art of Virtualization

Exokernel

exokernel

don't hide power!

Allows untrusted user-level applications to have direct access to system hardware. They present ExOS, an operating system implemented entirely in user-space libraries.

does this by securely multiplexing hardware resources between untrusted software

many programs have specialized behavior and their performance is severely hampered by being forced into using general OS abstractions to access hardware

library OS

  • libraries implementing some part of the OS can be app specific
  • libraries can trust the application (the exokernel will errors from hurting other applications)
  • less OS-app transitions since much of the OS (the library) is in the application's address space

exokernel requirements

  1. track ownership of resources
  2. performing access control (guarding usage or binding points)
  3. revoking access to resources

revocation

most OSs have invisible revocation of resources, so that application doesn't know when for example physical memory is being allocated or deallocated.

exokernels have visible revocation, so that applications can have some say in their allocation, and know when resources are scarce. even when the processor is taken at the end of a time-slice the application is notified.

this is necessary when the applications are using physical names to refer to resources, they must be notified upon revocation because their names will have to change

sometimes it's nice to allow "good faith" operations to take place before revocation of a resource

other times the exokernel will abort a misbehaving application

implementations

Aegis
exokernel
ExOS
Library OS

Aegis

process environments
store the information needed to deliver events associated with a resource to it's owner
  • exception
  • interrupt
  • protected entry
  • addressing

exceptions

transfers all exceptions to the application except system calls and interrupts

exception handling…

  1. saves three "scratch" registers into an agreed upon place
  2. loads the exception program counter, last non-valid virtual page address, and cause of exception
  3. uses exception cause to jump to pre-specified application program counter where processing resumes

features

  • very fast
  • very simple (because does not have to differentiate between TLB exceptions and all others)

address translation (application level virtual memory)

TODO

summary

an exokernel eliminates high level abstractions and focuses purely on securely multiplexing the hardware. a library OS can be build very efficiently upon an exokernel providing many of the standard OS features in a fast and extensible manner.

by allowing applications direct access to hardware it is possible for applications to greatly speed up their performance as compared to a traditional OS.

by implementing the majority of the OS as application libraries it is trivial to extend or tailor major components of the OS.

the only downside seems to be that the application has much more to worry about if it wants to take advantage of the potential speedup.

µ-kernels

performance-of-µ-kernel-based-systems

This paper aims to show that µ-kernel systems

  1. can run modern OS personalities
  2. can perform in the same range as normal monolithic kernels
  3. that extensions to µ-kernel based systems can be implemented efficiently in user space
  4. supports four basic processes; address-spaces, threads, scheduling, and synchronous inter-process communication

intro

  • a µ-kernel only provides address space, threads, and IPC
  • many people think that µ-kernels are either
    too low
    and these people try to add safeguards, or abstractions for helping extensions
    too high
    and these people try to make µ-kernel interfaces look like hardware interfaces
  • first generation µ-kernels like Chorus and Mach
    • evolved from monolithic kernels
  • second generation µ-kernels like QNX and L4
    • designed form scratch
    • more rigorous in pursuit of minimalist design
  • experiments
    • linux adapted to run on L4
      • gives upper performance bound
      • compare L4Linux to a linux adapted to the Mach kernel
      • insight to µ-kernel functions that affect linux performance
    • implemented pipes on top of µ-kernel and compared to native unix pipes
    • implemented mapping-related OS extensions
    • implemented first part of real time user-level memory management system
    • moved the L4 to a new processor
    • lower-level communication primitive

related work

L4 essentials

based on two basic concepts, threads and address spaces

thread
activity executing inside of an address space
IPC
cross address-space communication is a fundamental µ-kernel mechanism

the initial address space represents physical memory, additional address spaces are constructed by granting, mapping, and unmapping flex-pages of sizes 2n. the owner of an address space can grant map and unmap it's pages to/from other address spaces. these user-level pagers handle all address space construction and maintenance.

note
mapping and unmapping pages is like creating and deleting pages. mapped to physical memory or not

when there is a page-fault it is IPC'd by the µ-kernel to the pager associated with the faulting thread. the pager and thread have complete control as to how to handle the fault allowing many options for memory management

I/O ports are handled as address spaces, with device interrupts handled as IPC

exceptions and traps are synchronous to the executing thread, they are mirrored up to user-level

linux on L4

as linux now runs on multiple architectures there is a fairly well-defined interface between architecture dependent and independent sections

  • architecture-defendant section
    • interrupt service routine
    • low-level device driver support
    • user process interaction
    • context switching
    • copyin/copyout data between kernel and user spaces
    • signaling
    • mapping/unmapping of address spaces
    • system-call mechanism
  • linux uses a 3-level architecture independent page-table scheme
L4-linux design/implementation
  • fully binary compliant

µ-kernel tasks are used for user processes and provide linux services via a single linux server in a separate µ-kernel task.

the linux server
linux kernel's address space maps 1-1 to the underlying pager

Unix Time Sharing System

Unix Time Sharing System

wish I had read this to learn Unix/Posix systems

  • perhaps the most important achievement is demonstration of cheap
    • $40,000 in hardware
    • 2 man-years in development
  • UNIX takes ~50K of ~144K of memory on the computer
  • originally implemented largely in byte code, now all in C

File System

  • ordinary files
  • directories
  • special files
types of files
  • ordinary files: can hold any content, the file system places no limits
  • directories: fairly elegant specification of directories, each is a file holding the names of the files it contains, there is a root directory, there is normally a current directory, etc…
    • / is the "root" directory, which holds a path to all files
    • there are links (a file can live in multiple directories)
      • all links are equal (it doesn't actually live in any one) although in practice a file is made to disappear along with it's last link.
    • . and .. are special
  • special files: each I/O device is associated with a special file through which reading/writing to the I/O device occurs
    • file/device I/O are as similar as possible
    • file/device names have the same syntax and meaning
    • same protection mechanism
  • mount: system call which takes the name of an existing ordinary file, and the name of a special file which points to a device which has the structure of an independent file system. mount then replaces the existing file with the root of the independent file system. mounted file systems are identical to regular file systems with the single caveat that no links can exist between separate file systems.
protection
  • uid: each user assigned a unique id
  • 7 permission bits: 6 of which contain read/write/execute info for owner and all other users. 7th when set means that whenever the file is executed it is done so as the owner regardless of the user who triggered the execution.
  • super-user one user ID is exempt from all protections
I/O
  • no locks (they don't really help)
  • sequential access (systems keep a progress-pointer for each file)
  • possible to seek through the file
  • read/write calls return the number of bytes read/written
implementation

Each directory entry contains both the name and i-number for the related file.

  • the i-number is an index into the system table i-list which identifies the file's i-node which contains the following
    1. owner
    2. protection bits
    3. physical address of file
    4. size
    5. time-of last modification
    6. number of links (number of referencing directories)
    7. directory bit
    8. special bit
    9. large/small bit
  • ordinary files: the space on all storage is divided into 512-byte blocks
    • a small file fits into 8 or less blocks and the block addresses are stored
    • a large file uses the 8 blocks to hold 256 block addresses each allowing for file as large as 220 bytes
  • special files: first address word is used to indicate
    • device type: determines drivers used etc…
    • sub-device number: indicates which of the possible devices it is

all reading/writing appears as unbuffered and sync to user (it actually is buffered)

efficiency

The time was divided as follows: 63.5 percent assembler execution time, 16.5 percent system overhead, 20.0 percent disk wait time. We will not attempt any interpretation of these figures nor any comparison with other systems, but merely note that we are generally satisfied with the overall performance of the system.

Processes and Images

  • image: computer execution environment, core, registers, current directory, etc…
  • process: execution of an image

The user-core has three parts

  1. program text
  2. non-shared writable segment (heap)
  3. stack
processes

fork creates a process

processid = fork(label)

makes two identical copies of a process differentiated only in that the parent returns control directly while in the child control is passed to label. the return processid is the id of the other process.

pipes

interprocess communication uses same read/write calls used for files, only the info passes through a pipe

filep = pipe

a read on a pipe blocks until someone else writes to the same pipe

execution of programs
execute(file, arg, arg, ..., arg)

all code and data is replaced with that read from file

execute only returns if the execution fails (couldn't find file, or file is not executable)

process synchronization
processid = wait( )

suspends execution until a child process terminates, at which point the id of the child is returned.

termination
exit(status)

terminates a process destroying it's image. status is available to any ancestor which is waiting

processes also terminate from illegal actions or due to user signals

The Shell

take command lines and uses them to execute files with arguments.

standard I/O

programs run by the shell have two files (STDIN and STDOUT) which would be the terminal, but can be redirected to files using < and >.

these are intercepted by the shell and aren't passed as arguments to the program.

filters

commands separated by the | character are run simultaneously with the output of the left program sent to the input of the right program.

filters are commands which copy (with alteration) their standard input to their standard output

command separators & multitasking
  • ; can be used to separate multiple commands on a line
  • & can be used to run commands in the background
shell as command: command file

the shell is itself a command, and series of shell commands can be written to files (shell scripts)

implementation
  1. command passed to shell
  2. parsed into command and arguments
  3. fork is called
  4. child calls execute
  5. parent waits for child, then re-prints prompt

a-synch running is trivial (don't wait)

when child forks it inherits all open files from it's parent (including STDIN and STDOUT)

redirects > simply mean child changes it's file descriptors before calling execute

filters use pipes instead of files

shell only terminates when it sees an end-of-file in it's input

initialization

last step in Unix booting is calling executing the init command. init creates one process per available typewriter channel, each of these processes types out a login screen, and waits for a user. the init parent waits for a termination at which point it creates a new process for that typewriter channel and prints another login screen.

password file is checked after a user tries to log in. It contains a username, password, and the shell (or other program) to be run.

Traps

when illegal action is caught the program terminates, and it's image is written to the file core in the current directory

programs can be halted by sending the interrupt signal, which halts execution and does not write out the image to file

the quit signal is like interrupt but it does write out a core file

these hardware/user signals can be ignored or caught allowing programs (like shells or editors) to continue operation

Perspective

no predefined objectives, simply written on a spare computer for personal use with goal of a "comfortable relationship with the machine"

3 considerations (in retrospect)

  1. designed to write programs interactively.
    • interactive use is more fun than batch
    • initially only built for one user
  2. size constraints on system lead to economy and elegance
  3. from the beginning the system maintained itself, designers were using the system from the very beginning

since all programs need to be operable with any file/device it places all device-drivers into the OS

since the shell is just a user program it is easy to enhance, and actions like forking, redirection, background execution etc… are trivial

influences

not new ideas, but selection of particularly fertile ideas

  • fork from Berkeley
  • I/O routines from Multics
  • shell from Multics

Statistics

see paper for stats, presumably these are impressive

Observations on the Development of an Operating System

Observations on the Development of an Operating System

  • hypotheses
    1. Operating Systems can be divided into five kinds according to the style and direction of their development, independent of their structure.
    2. OS's take about 5-7 years to develop
  • focus on life-cycle of OS development, with the running example of the Pilot OS developed at IBM

summary: No matter what you might think, or how disciplined your team going in. When trying to build an new OS to be used by clients which represents a major step away from existing OSs there will be delays, and bloat. Expect 5-7 years before the system will be mature or useful or able to survive in the wild on its own.

Pilot

  1. kernel: 25,000 to 50,000 lines of Mesa code
  2. system development project: 250,000 lines of Mesa
    • kernel
    • debugger
    • compilers
    • librarian tools, etc…
  3. framework for thinking about designing/implementing systems for inter-subsystem and inter-computer communication

focus on 2nd meaning

Problems

size of the system: initially the kernel dominated the system size, but as outside functionality was absorbed and new tasks (development, running for multiple clients, etc…) added the system bloated both in and outside of the kernel

working set sizes: amount of real memory required to handle virtual memory without thrashing. Problems caused by the lull of virtual memory and lack of real feedback.

  • the working set of the kernel was almost constant across releases
  • at one point using more than double allowable working memory

programmer productivity: impossible to measure

holy wars:

  • processes and monitors v.s. message passing
  • different file system access systems

virtual memory system: based on assumption that disk access was very slow (this in the end was not the case). would have been almost as efficient to treat the disk as synchronous rather than jump through the many complex hoops built for async disk access

pipes filters and streams: Mesa streams are supposed to be like unix pipes. These streams are rarely used because Mesa is more of a type-safe API based language.

Comparing Pilot and other OSs

5 system types

  1. favorite systems (e.g. unix)
    • hugely successful
    • develop a large user community outside of their developer base
    • begin life as simple unambitious projects
    • grow because new outside users find them easy to extend
  2. planned systems
    • cut from whole cloth
    • generally with organizational backing
    • goals/structures are the product of up front negotiations (not organic growth)
    • some succeed and some don't
  3. branches of existing systems
    • major changes from existing system, but still able to borrow much supporting software
  4. laboratory systems
    • make contributions to the "art and science" of OS design
    • never gain large user base
  5. worthless systems

Five to Seven year rule

For planned systems of the second kind expect 5-7 years before reaching a viable OS.

time-line

  1. planning design
  2. initial implementation: no OS clients so little to no testing/feedback
  3. initial functionality: some hardy users begin cutting through the forests of bugs and issues
  4. painful refinement, making users happy
  5. client buy in: if reached, this is when the community starts adapting to and adding to the OS

Systems of the second time almost have to bee too ambitious or general for anyone to finance them. Hence the propensity for overrun deadlines or outright failure.

Hints for Computer System Design

Hints for Computer System Design

Collection of hints gathered from the Authors experience building a variety of systems.

Most important hints deal with interfaces which should

  1. be simple
  2. be complete
  3. admit a sufficiently small and fast implementation

Keep it simple

Perfection is reached not when there is no longer anything to add, but when there is no longer anything to take away. (A. Saint-Exupery)

  • don't try to put too much into an interface
  • do one thing and do it right
  • don't try to generalize too much
  • don't spend time making something fast unless it's really needed
  • get it right
    • don't expose functionality which if used will probably be used poorly
  • do it fast
    • a fast operation (if available/usable) is probably better than a powerful one
    • programs spend most of their time doing very simple things (loads, stores, incrementing, etc…)
  • don't hide power
    • if something works well and is useful at a low level, don't build abstractions on top of it
  • use procedure (functional) arguments
    • rather than defining a language of static arguments/options which then result in the procedure. (e.g. map, filter, etc…)
  • leave it to the client
    • relates to simplicity, only encode what is needed in every case in the interface, for the rest let the client built what she needs
    • unix, each command does one thing well, and the client connects them together

Continuity

  • keep basic interfaces stable
  • keep a place to stand
    • by implementing the old interface on top of the new one
    • word-swap debuggers, which re-create the memory on disk for stopping, inspecting, and restarting

Making implementations work

  • plan to throw one away
    • if you're doing something novel you will burn through at least one unusable prototype
  • keep secrets
    • assumptions of implementation that clients are not allowed to make
    • tension here with not hiding power

    An efficient program is an exercise in logical brinkmanship. (E. Dijkstra)

  • divide and conquer
    • recursive or bite-by-bite
  • use a good idea again
    • instead of generalizing it

Handling all the cases

  • handle normal and worst cases separately
    • different requirements
      • normal must be fast
      • worst must be possible

Speed

  • split resources in a fixed way if in doubt (easier then sharing)
  • use static analysis when possible
    • static analysis is analysis which doesn't require that the code be run
  • dynamic translation can be helpful.
    • translation in incremental steps between convenient readable representations to those that can be easily evaluated
  • cache answers to expensive computations
  • use hints like cached answers but they may be wrong and this can be checked
  • when in doubt use brute force don't be too fancy, don't work around assumptions which may not hold
    • special purpose hardware (e.g. FPGA)
  • compute in background take advantage of the lulls in activity
  • batch processing when you can do it all at once (rather than incrementally) then it will probably be easier and more reliable
  • safety first strive to avoid disaster before incrementally improving performance
  • shed load if demand is outstripping resources, begin dropping clients

Fault-tolerance

The unavoidable price of reliability is simplicity. (C. Hoare)

  • end-to-end

    Error recovery at the application level is absolutely necessary for a reliable system, and any other error detection or recovery is not logically necessary but is strictly for performance. – Saltzer

    • intermediate checks only serve performance
  • log updates it's cheap, reliable, and useful (like a transactional database)
  • make actions atomic or restartable

Conclusion

done

project

TODO paper [2/4]

DEADLINE: 2009-12-10 Thu

  • [X] go over 3-sched
  • [X] Con and LKML background
  • [X] data analysis
  • [X] look over results

BFS vs. CFS

Con vs. Ingo Molnar

according to Con Kolivas

  • BFS is simpler – ~9000 less lines of code than CFS
  • more appropriate for the loads of normal interactive desktop users
  • single runqueue -> much easier to gaurentee global fairness
  • no heuristics which try to guess interactivity from analysis of sleep time
  • interactive tasks will naturally be scheduled with high priority because:
    • if they're just waking up then they haven't used up their CPU time
    • they will have earlier effective deadlines

according to Ingo Molnar

people are regularly testing 3D smoothness, and they find CFS good enough and that matches my experience as well (as limited as it may be). In general my impression is that CFS and SD are roughly on par when it comes to 3D smoothness.

there was simply no code in existence before CFS which has proven the code simplicity/design virtues of 'fair scheduling' - SD was more of an argument against it than for it. I think maybe even Con might have been surprised by that simplicity: in his first lkml reaction to CFS he also wrote that he finds the CFS code 'beautiful', and my reply to Con's mail still addresses a good number of points raised in this thread i think.

Linus on choosing CFS over SD

  • Con can't be trusted to maintain his code

    that was where the SD patches fell down. They didn't have a maintainer that I could trust to actually care about any other issues than his own.

    as a long-term maintainer, trust me, I know what matters. And a person who can actually be bothered to follow up on problem reports is a hell of a lot more important than one who just argues with reporters

SD (Staircase Deadline) Scheduler

Brain Fuck Scheduler

  • http://ck.kolivas.org/patches/bfs/bfs-faq.txt
    • Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS.  That's a load of 1000 on a quad core machine.

timeline

  • 1999 Con gets into linux, and at around 2.4.18 he began preparing his own patches merging desktop-performance patches to the kernel (e.g. O1, preempt, low latency and compressed cache)
  • ck patchset seems to do great things for interactive kernel use

    One thing is for sure, the -ck patches before that one did an increadible job. Still, many years and hardware generations after, the best performing system I ever had (as in user experience, gapless audio playback while copying large and many files, …) was a 300 MHz Pentium II with probably 512 MB RAM running a 2.4 -ck kernel.

    My current systems still have gaps in Audio playback even though they are running at 1.8 GHz and more.

    I wish back my old system, just for playing audio.

  • 2002 Con is interviewed about ConTest (see here) a benchmarking tool which is heavily used by kernel developers
  • 2004 Con releases the Staircase scheduler (see here) (see this email)
  • Early 2007 Rotating Staircase Deadline scheduler (see here)
  • Linus seems amenable to RSDS mainline inclusion

    I agree, partly because it's obviously been getting rave reviews so far, but mainly because it looks like you can think about behaviour a lot better, something that was always very hard with the interactivity boosters with process state history.

  • the Staircase scheduler develops into the SD (Staircase Deadline) scheduler
  • early 2007 Ingo Molnar releases his own rewrite of Con's SD scheduler to much acclaim (see this node)
  • Cons is not pleased (see this email)
  • mid 2007 Con stops updating the -ck patchset (see this email)

    It is clear that I cannot develop code for the linux kernel intended only to be used out of mainline and not have mainline get involved somewhere along the line. Whether it be the users or even other developers repeatedly asking "when will this be merged". This forever gets me into a cycle of actually trying to merge the stuff and … well you all know what happens at that point (again I had nastier words but decided not to use them.)

    So, I've had enough. I'm out of here forever. I want to leave before I get so disgruntled that I end up using windows. I may play occasionally with userspace code but for me the kernel is a black hole that I don't want to enter the event horizon of again.

  • Ingo responds to Con's release 2009-09-06 (see this email)

    I understand that BFS is still early code and that you are not targeting BFS for mainline inclusion - but BFS is an interesting and bold new approach, cutting a lot of code out of kernel/sched*.c, so it raised my curiosity and interest :-)

    Alas, as it can be seen in the graphs, i can not see any BFS performance improvements, on this box.

    So the testbox i picked fits into the upper portion of what i consider a sane range of systems to tune for - and should still fit into BFS's design bracket as well according to your description: it's a dual quad core system with hyperthreading.

  • Con responds 2009-09-07 (see this email)

    /me sees Ingo run off to find the right combination of hardware and benchmark to prove his point.

    [snip lots of bullshit meaningless benchmarks showing how great cfs is and/or how bad bfs is, along with telling people they should use these artificial benchmarks to determine how good it is, demonstrating yet again why benchmarks fail the desktop]

    I'm not interested in a long protracted discussion about this since I'm too busy to live linux the way full time developers do, so I'll keep it short, and perhaps you'll understand my intent better if the FAQ wasn't clear enough.

    Do you know what a normal desktop PC looks like? No, a more realistic question based on what you chose to benchmark to prove your point would be: Do you know what normal people actually do on them?

    Feel free to treat the question as rhetorical.

notes

real tests

function latt-results(base="base"):
    results = Dir.entries(File.join(base)).map do |e|
      if e.match(/.*out(\d+).*/)
        [Integer($1)] +
          File.read(File.join(base, e)).map do |l|
          Integer($1) if l.match(/.*?(\d+) *usec.*/)
        end.compact
      end
    end.compact
data.each{ |l| puts "|"+l.join(" | ")+"|" }
13847124446391369031003835966515
220030955243015521903113710413517862
37364713612210091173383096174236288551611
410965821506258271318341028226312340781739
514867427177314951519395191281150375151809
614841633376381271751476689333882452912080
722334637809436991960525645396762506922274
825135643688531182312654439454026539912350
923471147452523882218668008512374579612454
1026894750518569472344756613567916633702609
data.each{ |l| puts "|"+l.join(" | ")+"|" }
13675693252649456310942965236
21545118812246954753313533027171
34476058738952423826464048712700601
446814843210647439874814790213865572
57366212727140155341366765654217872680
66250314414144755051547846510717297603
711668120178195896491753597645324407809
811010522831214486731952878181923305731
912486925198231566931658858931525439761
1015766827586245497061649809615427432789
11154270315152722675918901910600329155813
12204609398263590097123342110611434894943
13168486407213465891221937412000134546909
14163194415883326785224870612887435918919
15203498451973733693630827814187239753997
162136164794538915954245362147306414781017
1723221452437424951031304720157500446721083
1826103458236499301195298037158504499821196
1925061158823462551083303229172885479751123
2027988057325444281019369985186997489121122
data.each{ |l| puts "|"+l.join(" | ")+"|" }
13675693252649456310942965236
23603813331950891339093094175
3149771621311914763231452254787226
4162883554462119178503572415906244
52165050595668214101637697587882298
63128869016948244115349810858248290
736701889785252831324289303010158337
84280510986987631115147910432311902375
943571127181076632416880311698713642410
1057919152391219835518495412870014682427
1155153166641318937220622114141517527495
1261766187891442839423090014862315859433
1373299208341540940924432816377619161509
1468849228471669243325878317511019890516
1574255246031780245326737518482521259541
1694934278761953648830718419853525055626
1790519304942159253231914021059528265696
1893456324642252454534183821800229598716
19106604360422548561636705523906340259974
201168483883327290654389510257751449431077

test – new kernel

only taking stats from the first run as latt.c already does multiple runs for us and calculates error bars, etc…

results = Dir.entries(File.join(base)).map do |e|
  if e.match(/.*out(\d+).*/)
    [Integer($1)] +
      File.read(File.join(base, e)).map do |l|
      Integer($1) if l.match(/.*?(\d+) *usec.*/)
    end.compact
  end
end.compact
13847124446391369031003835966515
220030955243015521903113710413517862
37364713612210091173383096174236288551611
410965821506258271318341028226312340781739
514867427177314951519395191281150375151809
614841633376381271751476689333882452912080
722334637809436991960525645396762506922274
825135643688531182312654439454026539912350
923471147452523882218668008512374579612454
1026894750518569472344756613567916633702609

work errorbars

data/netbook-cfs-clientyonly.png

frame drops
base = "./project/2.6.31.6_hausmaster-laptop/av/"
results = Dir.entries(File.join(base)).map do |e|
  if e.match(/out(\d+).txt/)
    [Integer($1)] +
      File.read(File.join(base, e)).map do |l|
      (l.match(/V\:(\d+)\:(\d+)/)) ? [Float($1), Integer($2)] : nil
    end.compact.map{|l,r| [100-((r / (l+1))*100)] }.last
  end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }
1100.0
299.8543335761107
399.4147768836869
498.3618763961281
597.6761619190405
696.6565349544073
794.296875
893.7795275590551
994.5797329143755
1092.0255183413078

data/frame-drops.png

actually running some tests

latt.c

base = "./project/bfs"
results = Dir.entries(base).map do |e|
  if e.match(/i(\d+).out/)
    [Integer($1)] +
      File.read(File.join(base, e)).split("\n").map do |l|
      Integer($1) if l.match(/.*?(\d+) *usec.*/)
    end.compact
  end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }
base = "./project/bfs"
Dir.entries(base).map do |e|
  if e.match(/i(\d+).out/)
    [Integer($1)] +
      File.read(File.join(base, e)).map{|l| Integer($1) if l.match(/.*?(\d+) *usec.*/)}.compact
  end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }

work errorbars

wakeup errorbars

all on one

jeff's results
13221126240858334342926234
2157783439485828493302573277113416
331602727376753811225438125110231508
447619112561050846814694110765812848572
555621150681291552917907912673417147703
678896201321683465222404115355721634838
779376242011853068326124217597224642909
8999252803221225758300960205240322331151
91023093282924116829341542226719362071245
101229103802627450931378846256072413221401
139113063212451103734303
2373309829107336481237177579415559940
3500361672514469744166951103040204491052
4607712000116194739196976121333258111178
5998982626320435860222647138132306651290
61259113280824503967276826159907382841511
713631838887279181040301758173934432731612
816897944304314851130348513192425524251882
919339849936349931203376297206254549421889
1020897056251391171304428826219508630242101

work errorbars

wakeup errorbars

taylor results
148291158911788696356159
2383179162143645078183274157032152645397
3190153625694920062379821633453758710850
44008272971181629542979342187514547811369
55696712006198835134370527299750340338787
61970952911252762124363781633204315398712725
76265618670172923773438030400260234975128
8153654293894471691285285574173985104910420
9135979433304047077885550594664117848815105
102426124287368167124466280154758008738615954
155291578966289019418187
247361230179463414980514496732431147
372692684268777621699320288491592644
417619514054741369274191256837131683292
516372646357781492326104307699168554352
6217541071678621853391064360888253855983
7229601036675901656463452427938305556668
84391416872127302598511854477101265585421
94354315306109912115565166534687245294721
103239612250106982392641921602982388078677

work errorbars

wakeup errorbars

results of the initial short run
clientsmaxavgstdevstdev meanmaxavgstdevstdev mean
1322631455214091291694101
24618491145045946845459101374434
33577271881261732586601247203127313287
447612111901399331299277462032206794624
578830248992792655859936449770179193584
6531541511814660267611477069815203333712
7557651226618432348312109859936201653811
8616661724420994371113554073728344086083
9981982976828730478814988680922328445474
101191011892330233478016482368145418516617
clientsmaxavgstdevstdev meanmaxavgstdevstdev mean
15225146340523104328821177
21213531932951304271652477
379621149250864844959410412498645
41215520683109695685605628386851942
51358447954967993758266614888121762
621760660172031315860057320687181592
7192907422667611281075458764988811501
8422661062510161160711043692718140272218
94044513647113351690134468100833180102685
10310401366110206161413617711669391881453

building the kernel

initial build
  1. cd into the kernel directory
  2. copy your local configuration into the kernel config
    cp /boot/config-`uname -r` ./.config 
    
    
  3. run the menuconfig
    make menuconfig
    
    

    select the "load configuration" option, load your the .config file, and then exit

  4. now you can try to make the kernel with make
  5. install the build tools, and header files
    sudo apt-get install build-essential linux-headers-2-...
    
    
  6. still didn't work, then switched to the unstable debian repos (replaced "lenny" with "unstable" in /etc/apt/sources.list)
  7. with unstable I installed libc6-dev and tried again
  8. now missing zlib instead of eventfd.h
  9. installing zlib
    sudo apt-get install zlib1g-dev
    
    
  10. make the kernel make menuconfig. This spits out the following error message, but seems to succeed regardless
    make[1]: *** No rule to make target `just'. Stop.
    make: [Just] Error 2
    
  11. now make the Debian kernel package
    
    
    
    
    
    
  12. install the resulting .deb file
    dpkg -i linux-image......
    
    
  13. rebooted using the new kernel and it worked
bfs patch

Applied the BFS patch

  1. downloaded from …
  2. applied
    path -p1 << bfs-patch...
    
    
secondary build
  1. make the BFS-patched kernel
    fakeroot make-kpkg clean
    fakeroot make-kpkg --initrd --append-to-version=-bfs kernel_image kernel_headers
    
    
  2. install the resulting kernel
    sudo dpkg -i linux-image....bfs...deb
    
    

links

file:data/10.1.1.59.6385.pdf

History of the linux kernel

Linux test suite

CFS

Linus on CFS vs SD • http://kerneltrap.org/node/14008

Completely Fair Scheduler • http://en.wikipedia.org/wiki/Completely_Fair_Schedulerhttp://kerneltrap.org/node/8059http://www.linuxinsight.com/files/sched-design-CFS.txt

  • CFS design document

SD Scheduler • http://kerneltrap.org/SD_schedulerhttp://lwn.net/Articles/231973/

  • It has bound latency. CFS can't guarantee either as well as SD can. SD allows one to set the exact scheduling priority of everything and it is always respected, as there is no interactive renicing: it is very predictable.

Brain Fuck Scheduler

http://ck.kolivas.org/patches/bfs/bfs-faq.txt

  • Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS.  That's a load of 1000 on a quad core machine.

Scheduler Benchmarking

http://kerneltrap.org/mailarchive/linux-kernel/2007/9/17/261647http://lkml.org/lkml/2007/9/13/385http://devresources.linux-foundation.org/craiger/hackbench/

  • Hackbench benchmarking program

• Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS.  That's a load of 1000 on a quad core machine. • The 'latt' test app recently written by Jens Axboe is a better place for simpler to understand and useful numbers.

• 3D Smoothness testing

other schedulers to implement

lottery scheduler

lottery-scheduling

seems nice, nice math/stat background

GA scheduler

somehow evolve different scheduling algorithms

testing suites

Scheduler Benchmarking

project tasks [1/1]

DONE project proposal

DEADLINE: 2009-09-25 Fri

  • 2-page proposal/description
    • motivation
      • novel
      • solving problem
      • test (conventional wisdom)
      • measuring
      • comparing
    • objective
    • background
      • related work
      • literature
    • methodology
      • approach
      • hypothesis
      • validation
      • challenges
        • make sure reasonable for time span
        • make sure we have resources
    • expected results / impact
  • 1-3 people group
  • would prefer hardcopy, but a PDF is fine
  • project need not be completely defined, but should touch on potential sticking points

outline / topic

CPU scheduling http://kerneltrap.org/node/14008

Motivation - learn about kernels, proper testing environments and scheduling polices and mechanisms.

Objective - compare CFS and SD schedulers from 2007. Indentify and quantify these differences. We hope to identify these and quantify these.

Hypothesis - As indicated in the discussion between Linus Torvald and Kasper Sandberg, we expect the CFS and SD schedulers to perform better in certain niches.

Methodology - Use existing methodology and test responsiveness of throughput read pg. 704

Challenges - Setting up a valid testing and development environment. Development and testing will most likely be different (VM vs. Physical Machine). Putting together a good test suite to test different types of usage. How to evaluate performance as it's running. How does our choice in hardware affect the outcome of the results (choosing the hardware model that best)?

composition (challenges)

Challenges
Setting up a valid testing and development environment. Development and testing will most likely be different (VM vs. Physical Machine). Putting together a good test suite to test different types of usage. How to evaluate performance as it's running. How does our choice in hardware affect the outcome of the results (choosing the hardware model that best)?
  • testing and development environment
    • most likely different environments for development and for testing
    • VM, kernel module, algorithmic simulation
  • test suite
    • define what is meant by "interactive" use
    • tailored to the particular aims of our investigation
    • popular (so our results can be compared to others)
    • how to perform a "live" evaluation of the performance
      • Heisenberg uncertainty principle
  • impacts of hardware on results
  • resources
    • hardware
    • test suite

note

Also, something to note about the history of linux schedulers is that the SD scheduler was never merged into the mainline kernel. The predecessor to CFS was the "O(1) Scheduler." The SD scheduler was more of a contemporary competitor to the CFS that lost out.

final

intro

The release of Completely Fair Scheduler in 2007 sparked significant debate on various Linux kernel mailing lists and forums. Compared to its predecessor (SD) which used run-queues, CFS utilizes a time ordered red-black tree. While CFS design implemented a “radical” shift in data structures, the benefits are not immediately visible. In several instances The SD scheduler was reported to handle 3D gaming better, providing a smoother display to the user. SD was viewed as the reference in the development for CFS yet it seems the decision to include CFS in the mainline was partially political. As Linus Torvalds was quoted, “[A] person [Ingo] who can actually be bothered to follow up on problem reports is a hell of a lot more important than one who just argues with reporters [Con]”.

Our objective is to analyze the differences between the two methods of scheduling (including patched versions) and to determine the possible benefits of using one system over the other. This implies a wide range of testing procedures in order to provide a balanced perspective on the debate. A secondary goal is to gain first hand experience with kernels, proper testing environments, scheduler policies and mechanisms.

We hypothesize that early versions of the CFS scheduler's performance does not match that of SD, but through tweaking and applied patches, CFS surpasses SD in performance.

methodology

Testing the schedulers will require modifying the Linux kernel. We will investigate modifying the kernel on two different levels:

  • The first is to implement schedulers as individual kernel modules. This way is preferred as we would not have to recompile and maintain independent kernels but instead have individual scheduling modules compiled for the same kernel. We could specify which scheduler to use as a boot flag or, ideally, on the fly–if possible.
  • If using kernel modules is not possible, then we will be required to compile and install independent kernels for each of the schedulers that we want to test. These will be chosen from at boot time.

The CFS scheduler is presently in the mainline kernel (true as of 2.6.13). Implementing the SD scheduler will require applying patches against the mainline kernel. If we desire to separate the schedulers into individual kernel modules, this will require adaptation of the patches.

After our schedulers are implemented and ready for testing, we will concentrate on devising effective tests and benchmarks with which to evaluate them. We will be evaluating the schedulers according to the following criteria:

CPU utilization
how effectively can the scheduler utilize the CPU
Throughput
the rate at which jobs are completed
Turnaround time
the time it takes to finish a job
Waiting time
the time a job spends in a waiting queue
Response time
the interval between activations on the waiting queue

We will research existing benchmarks for testing schedulers and only write our own as a last resort when no other appropriate benchmarks can be found. In addition to artificial benchmarks, we will also perform real world tests, such as listening to music when other processes are hogging the processor and benchmarking games such as Unreal Tournament 2004.

In addition to the above, we are also interesting in exploring the following optional paths:

  • Testing Kolivas's Brain Fuck Scheduler (BFS)–this is a recent (August 2009) successor to the SD scheduler
  • Implementing control group schedulers such as round-robin to become more comfortable with writing our own schedulers
  • Experimenting with possible improvements to the schedulers, such as by tweaking parameters
challenges

There will be a number of challenges inherent in carrying out our methodology. The first being the establishment of appropriate kernel development and testing environments. Each of these environments will have different requirements

development
A good development environment should allow for a reasonably quick closed testing loop for new code, and should be well protected from the unpredictable and likely harmful side effects of experimental code. Given these restrictions a good development environment will likely be contained inside of a VM, or on an expendable piece of hardware.
testing
A good testing environment should resemble as closely as possible the actual production environment of the kernel. For this reason we will probably test directly on a physical machine, rather than through a virtual machine. If a wider variety of hardware is desirable than is available some sort of "simulated" test environment may be required. such a simulated scheduling environment would allow more flexibility in varying simulated hardware components and the related performance determining constants, but may yield less veracious results.

Once we have established an acceptable development and testing framework the next challenge will be the acquisition of a suitable testing suite. Two issues related to the availability of a test suite are the possibly prohibitive cost of high quality "standard" test suites and the potential lack of any widely accepted test suites directed at the particular aims of our study (specifically scheduler performance over different "types" of load including interactive use and batch use).

Some tradeoff will have to be made between the amount of information returned by a test suite $Δ P$, and the suites impact on the load $Δ L$ on the system. A situation similar to the Heisenberg uncertainty principle is expected where increasing the precision of our knowledge of the system at any point decreases the our knowledge of the load such that the two are only knowable up to some hardware constant $\hbar$.

$$ Δ P × Δ L \geq \frac{\hbar}{ 2 } $$

If this tradeoff proves untenable then we may be required to resort to a simulated test environment, or a scheme of partitioning the running system inside of a virtual machine and collecting our metrics from outside of the machine.

implementation

kernel
2.6.31 (this is what the current BFS patch is against)

exams [1/3]

TODO final exam

DEADLINE: 2009-12-15 Tue 07:30
in classroom

TODO final review

DEADLINE: 2009-12-11 Fri 09:00
in CS141

DONE midterm

DEADLINE: 2009-10-13 Tue

  • format
    • questions like the reading response questions
    • essay questions
  • topics
    • kernel design
    • memory management
    • virtualization
  • test general OS concepts
  • care less about specifics, and more about the effects of the mechanisms
    • not how did x solve y, rather, how could one solve y

topics

OS structure
standard monolithic
entire OS is in kernel space
pros
faster (less context switching)
cons
  • complexity, size
  • less flexible/extensible can't customize w/o changing kernel space code
  • harder to move to new hardware
  • less secure/stable (more low-level components to keep track of)
µ-kernel
only supports basic structures (l4 address spaces, threads, scheduling, and IPC) and pushes rest of the OS out into user-space servers
pros
  • simpler
  • easier to move to new hardware
  • flexible
  • more secure/reliable because of the simplicity of the low-level interface
cons
  • slower
exokernel
only does multiplexing of HW resources, rest of OS is in users pace libraries. end-to-end argument: application knows best how to handle it's own resources.
pros
  • direct access to hardware
  • flexible
cons
  • no security gains like in µ-kernel
  • cooperation
virtualization structures
as example of general system management structure
  • fault containment
  • porting old OS to new hardware
  • slower

understand

  • implication of these structures to the performance of the OS
    • micro-benchmarks
    • macro-benchmarks (applications)
  • implication for extensibility of the OS
  • separation of protection of resources, mechanisms, policy
processes and threads
address spaces
virtual memory of a process
process state
multi/batch/time-sharing programming
multi-programming
multiple tasks, can be single user
time-sharing
multiple tasks, normally multiple users
batch
space sharing rather than time sharing, but the CPU is generally only given to one task at a time. queue of processes
context switch
swap

models of communication

  • message passing
  • shared memory (on different architectures NUMA, UMA, etc…)

synchronization

monitors
software constructs surrounding critical sections
semaphores (less)
counting or binary, primitive locks, counting counts how many people can be in critical section simultaneously, decrements as each individual enters the critical section
mutex
simple binary semaphore (potentially has additional features protecting against priority-inversion)
critical section
section of code which is run inside of a lock, semaphore etc…
deadlock
(see deadlock)
condition variables, their semantics
used for IPC, avoid spinning.

Models (see notes above)

  • kernel threads
  • user threads
  • hybridization
  • scheduler activations paper
process scheduling
metrics to consider
responsiveness (time from submission to first response), submission to completion, wait time (sum of the time spent on the ready queue), throughput (job completion in a chunk), turnaround (form start to finish)
user-centric
response, wait, turnaround
system-centric
throughput, utilization
preemptive v.s. non-preemptive
can't knock a process off of the CPU until the process yields
fair scheduling
CPU is equally distributed between users or groups rather than among processes
memory management
working set
set of pages needed while running
thrashing
when the working set doesn't fit in memory, when the OS spends more time paging then executing
allocating memory (contiguous vs. non-contiguous)
contiguous maps the address space directly to disk through a base and offset, non-contiguous (like paging) allows individual pages to be loaded w/o loading the entire address space at once.
address space protection
gained through paging or segmentation
segmentation
like in the Multics paper
  • variable length
  • semantics (program or data)
  • permissions like on files
  • potentially with a directory structure
paging
allocation, selection, levels of caches, replacement
  • fixed size
  • less semantics than segments
  • mapping pages to disk
  • page faults are resolved as high up the cache hierarchy as possible
  • LRU, stuff like that
copy-on-write
p.325 Dinosaur book
memory-mapped IO
map a section of memory to a place on disk, and all you have to do is write to memory. copies part of disk to ram
  • this requires explicit handling in the user-level application. initial system call to set it up (open/close)
  • faster, only have to write to memory (and it will later be written to the mapped portion of the disk)
  • there is an explicit system call to sync to disk
  • might be asynchronous
  • slower for changes to propagate to disk
miscellaneous
  • reliability
  • scalability (clients, processors, resources, etc…)
    weak scaling
    increase workload as increase resources (constant time)
    strong scaling
    decrease time as increase resources (constant workload)

question

difference between grafting and co-location?

or between co-location and threads

how can secure binding ensure good behavior after binding?

concepts / terms

MIPS

http://en.wikipedia.org/wiki/Search?search=MIPSarchitecture

originally stood for "Microprocessor without Interlocked Pipeline Stages"

it is a RISC instruction set architecture.

proportional share scheduling

each entry is given some portion of the system relative to other entries proportion or relative to the total amount of resource

each process is assigned some portion relative to the portions of other applications

interrupts

p.499 in dinosaur

kernel traps

remote procedure call RPC

translation look-aside buffer TLB

quick overview of the mach µ-kernel

overview

System Design Principles

Unix

µ-kernel

Exokernel

Hoare and Mesa Monitors

Scheduler Activations

Lottery Scheduling

Resource Containers

Disco

Multics and VAX Memory Management

Mach Memory Management