Operating Systems with Linux
Computer Science from the Bottom Up
This book aims to work from operating systems fundamentals through to how those applications are complied and executed.1. General Unix and Advanced C
Everything is a file!
Implementing abstraction
Abstraction is implemented by what is generically termed an Application Programming Interface (API).Implementing abstraction with C
A common method used in the Linux kernel and other large C code bases is function pointers.#include <stdio.h> /* The API to implement */ struct greet_api { int (*say_hello)(char *name); int (*say_goodbye)(void); }; /* Our implementation of the hello function */ int say_hello_fn(char *name) { printf("Hello %s\n", name); return 0; } /* Our implementation of the goodbye function */ int say_goodbye_fn(void) { printf("Goodbye\n"); return 0; } /* A struct implementing the API */ struct greet_api greet_api = { .say_hello = say_hello_fn, .say_goodbye = say_goodbye_fn }; /* main() doesn't need to know anything about how the * say_hello/goodbye works, it just knows that it does */ int main(int argc, char *argv[]) { greet_api.say_hello(argv[1]); greet_api.say_goodbye(); printf("%p, %p, %p\n", greet_api.say_hello, say_hello_fn, &say_hello_fn); exit(0); }
Libraries
File Descriptors
The Shell
2. Binary and Number Representation
Binary — the basis of computing Binary Theory Hexadecimal Practical Implications Types and Number Representation C Standards Types Number Representation3. Computer Architecture
The CPU Branching Cycles Fetch, Decode, Execute, Store CISC v RISC Memory Memory Hierarchy Cache in depth Peripherals and buses Peripheral Bus concepts DMA Other Buses Small to big systems Symmetric Multi-Processing Clusters Non-Uniform Memory Access Memory ordering, locking and atomic operations4. The Operating System
The role of the operating system
Abstraction of hardware
Multitasking
Standardised Interfaces
Security
Performance
Operating System Organisation
The Kernel
Userspace
Each program runs in userspace, talking to the kernel through system calls.System Calls
Overview
Every system call has a system call number which is known by both the userspace and the kernel.For example, both know that system call number 10 is open(), system call number 11 is read(), etc.
The Application Binary Interface (ABI) is very similar to an API for hardware. The API will define which registers
- the system call number should be put in
- arguments should be put into for the system call
To actually perform the system call, all architectures define an instruction that signals to the hardware we wish to make a system call.
- This instruction will tell the hardware to modify the instruction pointer to point to the kernels system call handler (set up by OS). So once the userspace calls the instruction (via libc), it has lost control of the program and passed it over to the kernel.
- Then, the kernel looks in the predefined register for the system call number, and looks it up in a table to see which function it should call. This function is called, does what it needs to do, and places its return value into another register defined by the ABI as the return register.
- The final step is for the kernel to make a jump instruction back to the userspace program, so it can continue off where it left from. The userpsace program gets the data it needs from the return register, and continues happily on its way!
Analysing a system call
To illustrate the mechanism behind system calls#include <stdio.h> /* for syscall() */ #include <sys/syscall.h> #include <unistd.h> /* system call numbers */ #include <asm/unistd.h> void function(void) { int pid; pid = __syscall(__NR_getpid); }By convention under Linux, system calls numbers are defined in the asm/unistd.h file from the kernel source, system calls numbers are given a #define name consisting of __NR_.
- PowerPC system call example
#define _syscall0(type,name) \ type name(void) \ { \ __syscall_nr(0, type, name); \ }
#define _syscall0(type,name) \ type name(void) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ : "0" (__NR_##name)); \ __syscall_return(type,__res); }
Privileges
Applications should not be able to overwrite each others memory or files, and only access system resources as dictated by system policy.Hardware
By forcing the application to request resources through a system call into the kernel, the kernel can enforce rules about what sort of access can be provided.Other ways of communicating with the kernel
- ioctl()
- File Systems proc, sysfs, debugfs, etc
5. The Process
What is a process?
We can think of each process as a bundle of elements kept by the kernel to keep track of all these running tasks.Elements of a process
Process ID
Memory
Stacks are fundamental to function calls.Each time a function is called it gets a new stack frame.
This is an area of memory which usually contains, at a minimum,
- the address to return to when complete
- the input arguments to the function
- space for local variables.
- Each function is allocated a new stack frame with its arguments in a fresh area of memory.
- Each frame contains the address to return to
- Because each frame has a reference to the one before it, a debugger can "walk" backwards, following the pointers up the stack.( stack trace )
This is for variables whose memory requirements are not known at compile time.
File Descriptors
File descriptors are kept by the kernel individually for each process.Registers
The operating system needs to store a copy of the CPU registers to memory.When it is time for the process to run again, the operating system will copy the register values back from memory to the CPU registers and the process will be right back where it left off.
Kernel State
The kernel needs to keep track of a number of elements for each process.- Process State run, wait.
- Priority
- Statistics IO, CPU usage
5. The Process
Process Hierarchy
The kernel only ever directly starts one process called the init. Its PID is always 0All other processes can be considered children of this initial process.
PID TTY STAT TIME COMMAND 1 ? Ss 0:03 /sbin/init splash 2 ? S 0:00 [kthreadd] 3 ? I< 0:00 [rcu_gp] 4 ? I< 0:00 [rcu_par_gp] 6 ? I< 0:00 [kworker/0:0H-kblockd] 9 ? I< 0:00 [mm_percpu_wq] 10 ? S 0:00 [rcu_tasks_rude_] 11 ? S 0:00 [rcu_tasks_trace] 12 ? S 0:00 [ksoftirqd/0] 13 ? I 1:00 [rcu_sched] 14 ? S 0:00 [migration/0] 15 ? S 0:00 [idle_inject/0] 16 ? S 0:00 [cpuhp/0] 17 ? S 0:00 [cpuhp/1] 18 ? S 0:00 [idle_inject/1] 19 ? S 0:00 [migration/1] 20 ? S 0:00 [ksoftirqd/1] 22 ? I< 0:00 [kworker/1:0H-events_highpri] 23 ? S 0:00 [cpuhp/2] 24 ? S 0:00 [idle_inject/2] 25 ? S 0:00 [migration/2] 26 ? S 0:00 [ksoftirqd/2] 28 ? I< 0:00 [kworker/2:0H-events_highpri] 29 ? S 0:00 [cpuhp/3] 30 ? S 0:00 [idle_inject/3] 31 ? S 0:00 [migration/3] 32 ? S 0:00 [ksoftirqd/3] 34 ? I< 0:00 [kworker/3:0H-events_highpri] 35 ? S 0:00 [cpuhp/4] 36 ? S 0:00 [idle_inject/4] 37 ? S 0:00 [migration/4] 38 ? S 0:00 [ksoftirqd/4] ...
Fork and Exec
New processes are created by the two related interfaces fork and exec.Fork
The fork() system call will create a new process that is exactly the same as the parent process.This means all the state that was talked about previously is copied, including open files, register state and all memory allocations, which includes the program code.
The return value from the system call is the only way the process can determine if it was the existing process or a new one:
- the return value to the parent process will be the Process ID (PID) of the child
- the child will get a return value of 0
Exec
The exec() system call comes to run in a new process, but it is unrelated to the process started it.exec() will replace the contents of the currently running process with the information from a program binary.
Thus, a process launching a new program is to firstly fork, creating a new process, and then exec (i.e. load into memory and execute) the program binary it is supposed to run.
How Linux actually handles fork and exec
In the kernel, fork is actually implemented by a clone system call.This clone interfaces effectively provides a level of abstraction in how the Linux kernel can create processes.
clone allows you to explicitly specify which parts of the new process are copied into the new process, and which parts are shared between the two processes.
For the Linux kernel itself, there's absolutely no difference between what userspace sees as processes (the result of fork) and as threads (the result of pthread_create).
Both are represented by the same data structures (task_struct) and scheduled similarly.
Both processes and threads come into existence through a single Linux system call - clone.
Whatever perceived difference there is between processes and threads on Linux is achieved through passing different flags to clone.
The differences are mostly about what is shared between this new task and the task that started it.
The init process
On boot, the kernel starts the init process, which then forks and execs the systems boot scripts.These fork and exec more programs, eventually ending up forking a login process. If the process is "dead" (e.g. not running) and its parent has exited, it still needs to stay around until the return code is collected.
A process in this state is called a zombie .
Context Switching
Context switching refers to the process the kernel undertakes to switch from one process to another.Scheduling
The part of the kernel that keeps track of all these processes is called the scheduler because it schedules which process should be run next.Preemptive v co-operative scheduling
Scheduling strategies can broadly fall into two categories:- Co-operative scheduling The currently running process voluntarily gives up executing to allow another process to run.
- Preemptive scheduling The process is interrupted to stop it to allow another process to run, the hardware handles the interrupt independently of the running process.
The scheduler can decide which process to run next.
Each process gets a time-slice to run in.
Realtime
Hard realtime systems make guarantees about scheduling decisions like the maximum amount of time a process will be interrupted before it can run again.Nice value
UNIX systems assign each process a nice value.The scheduler looks at the nice value and can give priority to those processes that have a higher "niceness".
A brief look at the Linux Scheduler
The current scheduler is known as the O(1) scheduler.the O(1) scheduler uses a run queue structure as shown below.
- The run queue has a number of buckets in priority order and a bitmap that flags which buckets have processes available.
- reading the bitmap to find the first bucket with processes, then picking the first process off that bucket's queue.
- The scheduler keeps two such structures, an active and expired array for processes that are runnable and those which have utilised their entire time slice respectively.
The Shell
The primary job of the shell is to help the user handle starting, stopping and otherwise controlling processes running in the system.When you type a command at the prompt of the shell, it will fork a copy of itself and exec the command that you have specified.
Signals
On UNIX there is infrastructure between the kernel and processes called signals which allows a process to receive notification about events important to it.
When the kernel notices that you are touching memory outside your allocation, it will send you the segmentation fault signal.
Usually the process will not have a handler installed for this, and so the default action to terminate the program ensues (hence your program "crashes").
Example
6. Virtual Memory
What Virtual Memory isn't
swap space is used for unused parts of memory swapped out to disk to free up main memory (remember, programs can only execute from main memory).What virtual memory is
The address space of a processor refers the range of possible addresses that it can access when loading and storing to memory.The address space is limited by the width of the registers, since as we know to load an address we need to issue a load instruction with the address to load from stored in a register.
For example, registers that are 32 bits wide can hold addresses in a register range from 0x00000000 to 0xFFFFFFF. 2^32 is equal to 4GB, so a 32 bit processor can load or store to up to 4GB of memory.
64 bit computing
Every program compiled in 64-bit mode requires 8-byte pointers, which can increase code and data size, and hence impact both instruction and data cache performance.Using the address space
While 64-bit processors have 64-bit wide registers, systems generally do not implement all 64-bits for addressing.Thus most architectures define an unimplemented region of the address space which the processor will consider invalid for use.
The result of this is that the total address space is effectively divided into 3 parts, an upper , an invalid , and a lower portion.
virtual memory acts as an abstraction between the address space and the physical memory available in the system.
When a program uses an address , that address does not refer to the bits in an actual physical location in memory.
Therefore, all addresses a program uses are virtual.
The operating system keeps track of virtual addresses and how they are mapped to physical addresses.
When a program does a load or store from an address, the processor and operating system work together to convert this virtual address to the actual address in the system memory chips.
Pages
The total address-space is divided into individual pages.Each page has a number of attributes set by the operating system.
Generally, these include read, write and execute permissions for the current page.
For example, the operating system can generally mark the code pages of a process with an executable flag and the processor can choose to not execute any code from pages without this bit set.
Physical Memory
OS divides the possible address space into pages, divides the available physical memory up into frames.
The operating system keeps a frame-table which is a list of all possible pages of physical memory.
When memory is allocated to a process, it is marked as used in the frame-table.
How does the operating system know what memory is available?
This information is passed to the operating system by the BIOS during initialisation.
Pages + Frames = Page Tables
It is the job of the operating system is to keep track of which of virtual-page points to which physical frame. This information is kept in a page-table.Virtual Addresses
The user space program access the memory using the vritual address.A virtual address consists of two parts; the page and an offset into that page.
- Page The page component of the virtual address acts as an index into the page table.
- Offset The last bits of the virtual address are called the offset.
- Virtual Address Translation Virtual address translation refers to the process of finding out which physical page maps to which virtual page.
For a 4K page you require (4K == (4 * 1024) == 4096 == 212 ==) 12 bits of offset.
Consequences of virtual addresses, pages and page tables
Individual address spaces
As each process has its own page table, every process can pretend that it has access to the entire address space available from the processor.But, different page-tables for each process will map it to a different frame of physical memory, this doesn't imply that two processes might use the same address.
Over time, physical memory becomes fragmented, meaning that there are "holes" of free space in the physical memory.
By assigning a virtual address space to each process, the programmer can leave working around fragmentation up to the operating system.
Protection
The operating system is now the layer of abstraction between the process and physical memory access.If a process accesses a virtual address that is not covered by its page-table, then the operating system knows that that process is doing something wrong and can inform the process it has stepped out of its bounds.
Swap
If instead of pointing to an area of system memory, the page pointer can be changed to point to a location on a disk.When this page is referenced, the operating system needs to move it from the disk back into system memory (remember, program code can only execute from system memory).
If system memory is full, then another page needs to be kicked out of system memory and put into the swap disk before the required page can be put in memory.
The major issue for swap memory is loading from the hard disk is very slow (compared to operations done in memory).
Most people will be familiar with sitting in front of the computer whilst the hard disk churns and churns whilst the system remains unresponsive.
mmap
mmap() creates a new mapping in the virtual address space of the calling process.When a file is mmaped it can be accessed just like system RAM.
The address of the new mapping is returned as the result of the call.
The contents of a file mapping, are initialized using length bytes starting at offset offset in the file (or other object) referred to by the file descriptor fd.
Shared memory
If the operating system points two page table-entries to the same frame, this means that this frame will be shared; and any changes that one process makes will be visible to the other.Disk Cache
When the operating system has to read or write to a file, it first checks if the file is in it's memory cache.Hardware Support
Virtual memory is necessarily quite dependent on the hardware architecture.Physical vs Virtual Addressing Mode
All processors have some concept of either operating in physical or virtual mode.- In physical mode, the hardware expects that any address will refer to an address in actual system memory.
- In virtual mode, the hardware knows that addresses will need to be translated to find their physical address.
The TLB
The Translation Lookaside Buffer (or TLB for short) is the main component of the processor responsible for virtual-memory.It is a cache of virtual-page to physical-frame translations inside the processor.
The operating system and hardware work together to manage the TLB as the system runs.
When a virtual address is requested of the hardware, the processor looks for the virtual-address to physical-address translation in its TLB.
- If it has a valid translation it can then combine this with the offset portion to go straight to the physical address and complete the load.
- If the processor can not find a translation in the TLB the processor must raise a page fault.
This is similar to an interrupt (as discussed before) which the operating system must handle.
When the operating system gets a page fault, it needs to go through it's page-table to find the correct translation and insert it into the TLB.
If you have ever seen a segmentation fault (or a segfault) this is the operating system killing a process that has overstepped its bounds.
TLBs usually use something like a Least Recently Used or LRU algorithm, where the oldest translation that has not been used is ejected in favour of the new one.
TLB Management
Linux Specifics
Address Space Layout
Three Level Page Table
Hardware support for virtual memory
x86-64
Itanium
CMA
CMA works by reserving a large memory area at boot time and immediately giving back the memory to the kernel memory subsystem with the constraint that memory can only be handed out either for CMA use or for movable pages.
There is one CMA area for common use.
Subsystems could create further CMA areas for their own use.
CMA must be enabled in the kernel config.
Configure the size of the CMA area
There are three ways to configure the CMA area size.- the device tree overrules what’s on the kernel cmdline
- the kernel cmdline overrules the kernel configuration
linux,cma { compatible = "shared-dma-pool"; reusable; size = <0 0x3c000000>; alloc-ranges = <0 0x96000000 0 0x3c000000>; linux,cma-default; };
cma=256MB
7. The Toolchain
Compiled v Interpreted Programs Compiled Programs Interpreted programs Building an executable Compiling The process of compiling Syntax Assembly Generation Optimisation Assembler Linker Symbols The linking process A practical example Compiling Assembly Linking The Executable8. Behind the process
Review of executable files Representing executable files Three Standard Sections Binary Format Binary Format History ELF ELF File Header Symbols and Relocations Sections and Segments ELF Executables Libraries Static Libraries Shared Libraries Extending ELF concepts Debugging Custom sections Linker Scripts ABIs Byte Order Calling Conventions Starting a process Kernel communication to programs Starting the program9. Dynamic Linking
Code Sharing Dynamic Library Details Including libraries in an executable The Dynamic Linker Relocations Position Independence Global Offset Tables The Global Offset Table Libraries The Procedure Lookup Table Working with libraries and the linker Library versions Finding symbolsOperating Systems with Linux
by
John O’Gorman
1 Introduction
1.1 What is an operating system
user
application program
operating system
hardware
1.2 What does an operating system do?
- Easy of use
- Management and sharing of resources
- System compatibility
1.3 Interfaces to operating systems
user
application program, GUI, command interpreter
operating system
1.4 Historical development of operating systems
1.5 Types of operating systems
Distributed systems
It consists of a group of machines acting together as one.if one machine is in heavily loaded, the job will be transferred to another idle machine.
1.6 Design of operating systems
Modular design
An operating system can be decomposed into a collection of modules:- Process Management
- Memory Management
- I/O Management
- File Storage Management
- Network Management
Microkernel
An opertaing system can be restructured in 2 layers:- microkernel layer It provided a absolutely minimum facilities to deal with the hardware.
- emulation layer This emulate a particular operating system interface and uses the private interfaces provided by the microkernel layer.
微核心的設計理念,是將系統服務的實作,與系統的基本操作規則區分開來。它實作的方式,是將核心功能模組化,劃分成幾個獨立的行程,各自運行,這些行程被稱為服務(service)。所有的服務行程,都運行在不同的位址空間。只有需要絕對特權的行程,才能在具特權的執行模式下運行,其餘的行程則在使用者空間運行。
這樣的設計,使核心中最核心的功能,設計上變的更簡單。需要特權的行程,只有基本的執行緒管理,記憶體管理和行程間通訊等,這個部份,由一個簡單的硬體抽象層與關鍵的系統調用組成。其餘的服務行程,則移至使用者空間。
讓服務各自獨立,可以減少系統之間的耦合度,易於實作與除錯,也可增進可移植性。它可以避免單一組件失效,而造成整個系統崩潰,核心只需要重新啟動這個組件,不致於影響其他伺服器的功能,使系統穩定度增加。同時,作業系統也可以視需要,抽換或新增某些服務行程,使功能更有彈性。
因為所有服務行程都各自在不同位址空間運行,因此在微核心架構下,不能像單核心一樣直接進行函式調用。在微核心架構下,要建立一個行程間通訊機制,通過訊息傳遞的機制來讓服務行程間相互交換訊息,調用彼此的服務,以及完成同步。採用主從式架構,使得它在分散式系統中有特別的優勢,因為遠端系統與本地行程間,可以採用同一套行程間通訊機制。
但是因為行程間通訊耗費的資源與時間,比簡單的函式呼叫還多;通常又會涉及到核心空間到使用者空間的環境切換(context switch)。這使得訊息傳遞有延遲,以及傳輸量(throughput)受限的問題,因此微核心可能出現效能不佳的問題。
Linux
Linux is a monolithic kernel.2 Interfaces to an operating system
2.1 The system service interface
Each system service is requested by name( read, write,...), the service is identified by a number.Interfaces with hardware
The interface between the OS and the hardware is unique.
Synchronization
The data register is used with a status register: the hardware update the status register and the OS will check it before reading the data register.
留言