8月 01, 2019

Operating Systems with Linux

Computer Science from the Bottom Up

This book aims to work from operating systems fundamentals through to how those applications are complied and executed.

1. General Unix and Advanced C

Everything is a file!

Implementing abstraction

Abstraction is implemented by what is generically termed an Application Programming Interface (API).

Implementing abstraction with C

A common method used in the Linux kernel and other large C code bases is function pointers.


#include <stdio.h>

/* The API to implement */
struct greet_api
{
	int (*say_hello)(char *name);
	int (*say_goodbye)(void);
};

/* Our implementation of the hello function */
int say_hello_fn(char *name)
{
	printf("Hello %s\n", name);
	return 0;
}

/* Our implementation of the goodbye function */
int say_goodbye_fn(void)
{
	printf("Goodbye\n");
	return 0;
}

/* A struct implementing the API */
struct greet_api greet_api =
{
	.say_hello = say_hello_fn,
	.say_goodbye = say_goodbye_fn
};

/* main() doesn't need to know anything about how the
 * say_hello/goodbye works, it just knows that it does */
int main(int argc, char *argv[])
{
	greet_api.say_hello(argv[1]);
	greet_api.say_goodbye();

	printf("%p, %p, %p\n", greet_api.say_hello, say_hello_fn, &say_hello_fn);

	exit(0);
}

Libraries

File Descriptors

The Shell

2. Binary and Number Representation

Binary — the basis of computing Binary Theory Hexadecimal Practical Implications Types and Number Representation C Standards Types Number Representation

3. Computer Architecture

The CPU Branching Cycles Fetch, Decode, Execute, Store CISC v RISC Memory Memory Hierarchy Cache in depth Peripherals and buses Peripheral Bus concepts DMA Other Buses Small to big systems Symmetric Multi-Processing Clusters Non-Uniform Memory Access Memory ordering, locking and atomic operations

4. The Operating System

The role of the operating system

Abstraction of hardware

Multitasking

Standardised Interfaces

Security

Performance

Operating System Organisation

The Kernel

Userspace

Each program runs in userspace, talking to the kernel through system calls.

System Calls

Overview

Every system call has a system call number which is known by both the userspace and the kernel.
For example, both know that system call number 10 is open(), system call number 11 is read(), etc.

The Application Binary Interface (ABI) is very similar to an API for hardware. The API will define which registers

the system call number should be put in
arguments should be put into for the system call

so the kernel can find it when it is asked to do the system call.
To actually perform the system call, all architectures define an instruction that signals to the hardware we wish to make a system call.

This instruction will tell the hardware to modify the instruction pointer to point to the kernels system call handler (set up by OS).

libc

Then, the kernel looks in the predefined register for the system call number, and looks it up in a table to see which function it should call.

The final step is for the kernel to make a jump instruction back to the userspace program, so it can continue off where it left from.

The standard library that deals with system calls on UNIX like systems is libc.

Analysing a system call

To illustrate the mechanism behind system calls


#include <stdio.h>
    
/* for syscall() */
#include <sys/syscall.h>
#include <unistd.h>
    
/* system call numbers */
#include <asm/unistd.h>
    
void function(void)
{
    	int pid;
    
    	pid = __syscall(__NR_getpid);
}

By convention under Linux, system calls numbers are defined in the asm/unistd.h file from the kernel source, system calls numbers are given a #define name consisting of __NR_.

PowerPC system call example


    #define _syscall0(type,name)						\
    type name(void)								\
    {									\
    	__syscall_nr(0, type, name);					\
    }

x86 system call example


    #define _syscall0(type,name)			\
    type name(void)					\
    {						\
    long __res;					\
    __asm__ volatile ("int $0x80"			\
            : "=a" (__res)				\
            : "0" (__NR_##name));			\
    __syscall_return(type,__res);
    }

Privileges

Applications should not be able to overwrite each others memory or files, and only access system resources as dictated by system policy.

Hardware

By forcing the application to request resources through a system call into the kernel, the kernel can enforce rules about what sort of access can be provided.

Other ways of communicating with the kernel

ioctl()
File Systems

proc

sysfs

5. The Process

What is a process?

We can think of each process as a bundle of elements kept by the kernel to keep track of all these running tasks.

Elements of a process

Process ID

Memory

Stacks are fundamental to function calls.
Each time a function is called it gets a new stack frame.
This is an area of memory which usually contains, at a minimum,

the address to return to when complete
the input arguments to the function
space for local variables.

A stack brings about many of the features of functions:

Each function is allocated a new stack frame with its arguments in a fresh area of memory.
Each frame contains the address to return to
Because each frame has a reference to the one before it, a debugger can "walk" backwards, following the pointers up the stack.( stack trace )

The heap is an area of memory that is managed by the process for on the fly memory allocation.
This is for variables whose memory requirements are not known at compile time.

File Descriptors

File descriptors are kept by the kernel individually for each process.

Registers

The operating system needs to store a copy of the CPU registers to memory.
When it is time for the process to run again, the operating system will copy the register values back from memory to the CPU registers and the process will be right back where it left off.

Kernel State

The kernel needs to keep track of a number of elements for each process.

Process State
Priority
Statistics

5. The Process

Process Hierarchy

The kernel only ever directly starts one process called the init. Its PID is always 0
All other processes can be considered children of this initial process.


    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:03 /sbin/init splash
      2 ?        S      0:00 [kthreadd]
      3 ?        I<     0:00 [rcu_gp]
      4 ?        I<     0:00 [rcu_par_gp]
      6 ?        I<     0:00 [kworker/0:0H-kblockd]
      9 ?        I<     0:00 [mm_percpu_wq]
     10 ?        S      0:00 [rcu_tasks_rude_]
     11 ?        S      0:00 [rcu_tasks_trace]
     12 ?        S      0:00 [ksoftirqd/0]
     13 ?        I      1:00 [rcu_sched]
     14 ?        S      0:00 [migration/0]
     15 ?        S      0:00 [idle_inject/0]
     16 ?        S      0:00 [cpuhp/0]
     17 ?        S      0:00 [cpuhp/1]
     18 ?        S      0:00 [idle_inject/1]
     19 ?        S      0:00 [migration/1]
     20 ?        S      0:00 [ksoftirqd/1]
     22 ?        I<     0:00 [kworker/1:0H-events_highpri]
     23 ?        S      0:00 [cpuhp/2]
     24 ?        S      0:00 [idle_inject/2]
     25 ?        S      0:00 [migration/2]
     26 ?        S      0:00 [ksoftirqd/2]
     28 ?        I<     0:00 [kworker/2:0H-events_highpri]
     29 ?        S      0:00 [cpuhp/3]
     30 ?        S      0:00 [idle_inject/3]
     31 ?        S      0:00 [migration/3]
     32 ?        S      0:00 [ksoftirqd/3]
     34 ?        I<     0:00 [kworker/3:0H-events_highpri]
     35 ?        S      0:00 [cpuhp/4]
     36 ?        S      0:00 [idle_inject/4]
     37 ?        S      0:00 [migration/4]
     38 ?        S      0:00 [ksoftirqd/4]
...

Fork and Exec

New processes are created by the two related interfaces fork and exec.

Fork

The fork() system call will create a new process that is exactly the same as the parent process.
This means all the state that was talked about previously is copied, including open files, register state and all memory allocations, which includes the program code.
The return value from the system call is the only way the process can determine if it was the existing process or a new one:

the return value to the parent process will be the Process ID (PID) of the child
the child will get a return value of 0

Exec

The exec() system call comes to run in a new process, but it is unrelated to the process started it.
exec() will replace the contents of the currently running process with the information from a program binary.
Thus, a process launching a new program is to firstly fork, creating a new process, and then exec (i.e. load into memory and execute) the program binary it is supposed to run.

How Linux actually handles fork and exec

In the kernel, fork is actually implemented by a clone system call.
This clone interfaces effectively provides a level of abstraction in how the Linux kernel can create processes.

clone allows you to explicitly specify which parts of the new process are copied into the new process, and which parts are shared between the two processes.

For the Linux kernel itself, there's absolutely no difference between what userspace sees as processes (the result of fork) and as threads (the result of pthread_create).
Both are represented by the same data structures (task_struct) and scheduled similarly.
Both processes and threads come into existence through a single Linux system call - clone. Whatever perceived difference there is between processes and threads on Linux is achieved through passing different flags to clone.
The differences are mostly about what is shared between this new task and the task that started it.

The init process

On boot, the kernel starts the init process, which then forks and execs the systems boot scripts.
These fork and exec more programs, eventually ending up forking a login process. If the process is "dead" (e.g. not running) and its parent has exited, it still needs to stay around until the return code is collected.
A process in this state is called a zombie .

Context Switching

Context switching refers to the process the kernel undertakes to switch from one process to another.

Scheduling

The part of the kernel that keeps track of all these processes is called the scheduler because it schedules which process should be run next.

Preemptive v co-operative scheduling

Scheduling strategies can broadly fall into two categories:

Co-operative scheduling
Preemptive scheduling

time-slice

Realtime

Hard realtime systems make guarantees about scheduling decisions like the maximum amount of time a process will be interrupted before it can run again.

Nice value

UNIX systems assign each process a nice value.
The scheduler looks at the nice value and can give priority to those processes that have a higher "niceness".

A brief look at the Linux Scheduler

The current scheduler is known as the O(1) scheduler.
the O(1) scheduler uses a run queue structure as shown below.

The run queue has a number of buckets in priority order and a bitmap that flags which buckets have processes available.
reading the bitmap to find the first bucket with processes, then picking the first process off that bucket's queue.
The scheduler keeps two such structures, an active and expired array for processes that are runnable and those which have utilised their entire time slice respectively.

The Shell

The primary job of the shell is to help the user handle starting, stopping and otherwise controlling processes running in the system.
When you type a command at the prompt of the shell, it will fork a copy of itself and exec the command that you have specified.

Signals

On UNIX there is infrastructure between the kernel and processes called signals which allows a process to receive notification about events important to it.

When the kernel notices that you are touching memory outside your allocation, it will send you the segmentation fault signal.
Usually the process will not have a handler installed for this, and so the default action to terminate the program ensues (hence your program "crashes").

Example

6. Virtual Memory

What Virtual Memory isn't

swap space is used for unused parts of memory swapped out to disk to free up main memory (remember, programs can only execute from main memory).

What virtual memory is

The address space of a processor refers the range of possible addresses that it can access when loading and storing to memory.
The address space is limited by the width of the registers, since as we know to load an address we need to issue a load instruction with the address to load from stored in a register.
For example, registers that are 32 bits wide can hold addresses in a register range from 0x00000000 to 0xFFFFFFF. 2^32 is equal to 4GB, so a 32 bit processor can load or store to up to 4GB of memory.

64 bit computing

Every program compiled in 64-bit mode requires 8-byte pointers, which can increase code and data size, and hence impact both instruction and data cache performance.

Using the address space

While 64-bit processors have 64-bit wide registers, systems generally do not implement all 64-bits for addressing.
Thus most architectures define an unimplemented region of the address space which the processor will consider invalid for use.
The result of this is that the total address space is effectively divided into 3 parts, an upper , an invalid , and a lower portion.

virtual memory acts as an abstraction between the address space and the physical memory available in the system.
When a program uses an address , that address does not refer to the bits in an actual physical location in memory.
Therefore, all addresses a program uses are virtual.
The operating system keeps track of virtual addresses and how they are mapped to physical addresses.
When a program does a load or store from an address, the processor and operating system work together to convert this virtual address to the actual address in the system memory chips.

Pages

The total address-space is divided into individual pages.
Each page has a number of attributes set by the operating system.
Generally, these include read, write and execute permissions for the current page.
For example, the operating system can generally mark the code pages of a process with an executable flag and the processor can choose to not execute any code from pages without this bit set.

Physical Memory

OS divides the possible address space into pages, divides the available physical memory up into frames.

The operating system keeps a frame-table which is a list of all possible pages of physical memory.
When memory is allocated to a process, it is marked as used in the frame-table.
How does the operating system know what memory is available?
This information is passed to the operating system by the BIOS during initialisation.

Pages + Frames = Page Tables

It is the job of the operating system is to keep track of which of virtual-page points to which physical frame. This information is kept in a page-table.

Virtual Addresses

The user space program access the memory using the vritual address.
A virtual address consists of two parts; the page and an offset into that page.

Page
Offset

Virtual Address Translation

Consequences of virtual addresses, pages and page tables

Individual address spaces

As each process has its own page table, every process can pretend that it has access to the entire address space available from the processor.
But, different page-tables for each process will map it to a different frame of physical memory, this doesn't imply that two processes might use the same address.

Over time, physical memory becomes fragmented, meaning that there are "holes" of free space in the physical memory.
By assigning a virtual address space to each process, the programmer can leave working around fragmentation up to the operating system.

Protection

The operating system is now the layer of abstraction between the process and physical memory access.
If a process accesses a virtual address that is not covered by its page-table, then the operating system knows that that process is doing something wrong and can inform the process it has stepped out of its bounds.

Swap

If instead of pointing to an area of system memory, the page pointer can be changed to point to a location on a disk.
When this page is referenced, the operating system needs to move it from the disk back into system memory (remember, program code can only execute from system memory).
If system memory is full, then another page needs to be kicked out of system memory and put into the swap disk before the required page can be put in memory.
The major issue for swap memory is loading from the hard disk is very slow (compared to operations done in memory).
Most people will be familiar with sitting in front of the computer whilst the hard disk churns and churns whilst the system remains unresponsive.

mmap

mmap() creates a new mapping in the virtual address space of the calling process.
When a file is mmaped it can be accessed just like system RAM.
The address of the new mapping is returned as the result of the call.
The contents of a file mapping, are initialized using length bytes starting at offset offset in the file (or other object) referred to by the file descriptor fd.

Shared memory

If the operating system points two page table-entries to the same frame, this means that this frame will be shared; and any changes that one process makes will be visible to the other.

Disk Cache

When the operating system has to read or write to a file, it first checks if the file is in it's memory cache.

Hardware Support

Virtual memory is necessarily quite dependent on the hardware architecture.

Physical vs Virtual Addressing Mode

All processors have some concept of either operating in physical or virtual mode.

In physical mode, the hardware expects that any address will refer to an address in actual system memory.
In virtual mode, the hardware knows that addresses will need to be translated to find their physical address.

In x86, the two modes are referred to as real and protected mode.

The TLB

The Translation Lookaside Buffer (or TLB for short) is the main component of the processor responsible for virtual-memory.
It is a cache of virtual-page to physical-frame translations inside the processor.
The operating system and hardware work together to manage the TLB as the system runs.

When a virtual address is requested of the hardware, the processor looks for the virtual-address to physical-address translation in its TLB.

If it has a valid translation
If the processor can not find a translation in the TLB

page fault

page-table

TLB

In the case that the operating system can not find a translation in the page table, or alternatively if the operating system checks the permissions of the page in question and the process is not authorised to access it, the operating system must kill the process.
If you have ever seen a segmentation fault (or a segfault) this is the operating system killing a process that has overstepped its bounds.

TLBs usually use something like a Least Recently Used or LRU algorithm, where the oldest translation that has not been used is ejected in favour of the new one.

TLB Management

Linux Specifics

Address Space Layout

Three Level Page Table

Hardware support for virtual memory

x86-64

Itanium

CMA

CMA works by reserving a large memory area at boot time and immediately giving back the memory to the kernel memory subsystem with the constraint that memory can only be handed out either for CMA use or for movable pages.

There is one CMA area for common use.
Subsystems could create further CMA areas for their own use.

CMA must be enabled in the kernel config.

Configure the size of the CMA area

There are three ways to configure the CMA area size.

the device tree overrules what’s on the kernel cmdline


linux,cma {
  compatible = "shared-dma-pool";
  reusable;
  size = <0 0x3c000000>;
  alloc-ranges = <0 0x96000000 0 0x3c000000>;
  linux,cma-default;
};

the kernel cmdline overrules the kernel configuration


cma=256MB

7. The Toolchain

Compiled v Interpreted Programs Compiled Programs Interpreted programs Building an executable Compiling The process of compiling Syntax Assembly Generation Optimisation Assembler Linker Symbols The linking process A practical example Compiling Assembly Linking The Executable

8. Behind the process

Review of executable files Representing executable files Three Standard Sections Binary Format Binary Format History ELF ELF File Header Symbols and Relocations Sections and Segments ELF Executables Libraries Static Libraries Shared Libraries Extending ELF concepts Debugging Custom sections Linker Scripts ABIs Byte Order Calling Conventions Starting a process Kernel communication to programs Starting the program

9. Dynamic Linking

Code Sharing Dynamic Library Details Including libraries in an executable The Dynamic Linker Relocations Position Independence Global Offset Tables The Global Offset Table Libraries The Procedure Lookup Table Working with libraries and the linker Library versions Finding symbols

Operating Systems with Linux
by
John O’Gorman

1 Introduction

1.1 What is an operating system


        user

application program

   operating system

       hardware

1.2 What does an operating system do?

Easy of use
Management and sharing of resources
System compatibility

1.3 Interfaces to operating systems


                  user

application program, GUI, command interpreter

             operating system

1.4 Historical development of operating systems

1.5 Types of operating systems

Distributed systems

It consists of a group of machines acting together as one.
if one machine is in heavily loaded, the job will be transferred to another idle machine.

1.6 Design of operating systems

Modular design

An operating system can be decomposed into a collection of modules:

Process Management
Memory Management
I/O Management
File Storage Management
Network Management

Microkernel

An opertaing system can be restructured in 2 layers:

microkernel layer
emulation layer

微核心的設計理念，是將系統服務的實作，與系統的基本操作規則區分開來。它實作的方式，是將核心功能模組化，劃分成幾個獨立的行程，各自運行，這些行程被稱為服務（service）。所有的服務行程，都運行在不同的位址空間。只有需要絕對特權的行程，才能在具特權的執行模式下運行，其餘的行程則在使用者空間運行。

這樣的設計，使核心中最核心的功能，設計上變的更簡單。需要特權的行程，只有基本的執行緒管理，記憶體管理和行程間通訊等，這個部份，由一個簡單的硬體抽象層與關鍵的系統調用組成。其餘的服務行程，則移至使用者空間。

讓服務各自獨立，可以減少系統之間的耦合度，易於實作與除錯，也可增進可移植性。它可以避免單一組件失效，而造成整個系統崩潰，核心只需要重新啟動這個組件，不致於影響其他伺服器的功能，使系統穩定度增加。同時，作業系統也可以視需要，抽換或新增某些服務行程，使功能更有彈性。

因為所有服務行程都各自在不同位址空間運行，因此在微核心架構下，不能像單核心一樣直接進行函式調用。在微核心架構下，要建立一個行程間通訊機制，通過訊息傳遞的機制來讓服務行程間相互交換訊息，調用彼此的服務，以及完成同步。採用主從式架構，使得它在分散式系統中有特別的優勢，因為遠端系統與本地行程間，可以採用同一套行程間通訊機制。

但是因為行程間通訊耗費的資源與時間，比簡單的函式呼叫還多；通常又會涉及到核心空間到使用者空間的環境切換（context switch）。這使得訊息傳遞有延遲，以及傳輸量（throughput）受限的問題，因此微核心可能出現效能不佳的問題。

Linux

Linux is a monolithic kernel.

2 Interfaces to an operating system

2.1 The system service interface

Each system service is requested by name( read, write,...), the service is identified by a number.

Interfaces with hardware

The interface between the OS and the hardware is unique.

Synchronization

The data register is used with a status register: the hardware update the status register and the OS will check it before reading the data register.