11月 27, 2018

Linux Device Drivers -II

Content:

4. Debugging Techniques
5. Concurrency and Race Conditions
6. Advanced Char Driver Operations

Chapter 4: Debugging Techniques

Debugging Support in the Kernel

Kernel provides some debug features, all of these options are found under the “kernel hacking” menu.

CONFIG_DEBUG_KERNEL
CONFIG_DEBUG_SLAB
CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_SPINLOCK
CONFIG_DEBUG_SPINLOCK_SLEEP
CONFIG_INIT_DEBUG
CONFIG_DEBUG_INFO
CONFIG_MAGIC_SYSRQ
CONFIG_DEBUG_STACKOVERFLOW
CONFIG_DEBUG_STACK_USAGE
CONFIG_KALLSYMS(under “General setup/Standard features”)
CONFIG_IKCONFIG, CONFIG_IKCONFIG_PROC(in the “General setup” menu)
CONFIG_ACPI_DEBUG(Under “Power management/ACPI.”)
CONFIG_DEBUG_DRIVER(Under “Device drivers.”)
CONFIG_SCSI_CONSTANTS
CONFIG_INPUT_EVBUG(under “Device drivers/Input device support”)
CONFIG_PROFILING(under “Profiling support.”)

Debugging by Printing

printk

printk() is similar to printf(), the format string, while largely compatible with C99, doesn’t follow the exact same specification.
All printk() messages are printed to the kernel log buffer, which is a ring buffer exported to userspace through /dev/kmsg. The usual way to read it is using dmesg.
printk() is typically used like this:


printk(KERN_INFO "Message: %s\n", arg);

where KERN_INFO is the log level.
The log level specifies the importance of a message. The kernel decides whether to show the message immediately (printing it to the current console) depending on the message's log level and the current console_loglevel (a kernel variable).

Name	String	Alias function
KERN_EMERG	“0”	pr_emerg()
KERN_ALERT	“1”	pr_alert()
KERN_CRIT	“2”	pr_crit()
KERN_ERR	“3”	pr_err()
KERN_WARNING	“4”	pr_warn()
KERN_NOTICE	“5”	pr_notice()
KERN_INFO	“6”	pr_info()
KERN_DEBUG	“7”	pr_debug() and pr_devel() if DEBUG is defined
KERN_DEFAULT	“”
KERN_CONT	“c”	pr_cont()

If the log level is omitted, the message is printed with KERN_DEFAULT level.
There are eight possible loglevel strings, defined in the header <linux/kern_levels.h> :


#define KERN_SOH "\001"  /* ASCII Start Of Header */
#define KERN_SOH_ASCII '\001'

#define KERN_EMERG KERN_SOH "0" /* system is unusable */
#define KERN_ALERT KERN_SOH "1" /* action must be taken immediately */
#define KERN_CRIT KERN_SOH "2" /* critical conditions */
#define KERN_ERR KERN_SOH "3" /* error conditions */
#define KERN_WARNING KERN_SOH "4" /* warning conditions */
#define KERN_NOTICE KERN_SOH "5" /* normal but significant condition */
#define KERN_INFO KERN_SOH "6" /* informational */
#define KERN_DEBUG KERN_SOH "7" /* debug-level messages */

#define KERN_DEFAULT KERN_SOH "d" /* the default kernel loglevel */

Each string (in the macro expansion) represents an integer in angle brackets.
Integers range from 0 to 7, with smaller values representing higher priorities.
If the priority is less than the integer variable console_loglevel, the message is delivered to the console one line at a time (nothing is sent unless a trailing newline is provided).


/*
 * Default used to be hard-coded at 7, quiet used to be hardcoded at 4,
 * we're now allowing both to be set from kernel config.
 */
#define CONSOLE_LOGLEVEL_DEFAULT CONFIG_CONSOLE_LOGLEVEL_DEFAULT

int console_printk[4] = {
 CONSOLE_LOGLEVEL_DEFAULT, /* console_loglevel */
 MESSAGE_LOGLEVEL_DEFAULT, /* default_message_loglevel */
 CONSOLE_LOGLEVEL_MIN,  /* minimum_console_loglevel */
 CONSOLE_LOGLEVEL_DEFAULT, /* default_console_loglevel */
};

#define console_loglevel (console_printk[0])

If both klogd and syslogd are running on the system, kernel messages are appended to /var/log. If klogd is not running, the message can be read via /proc/kmsg.
The text file /proc/sys/kernel/printk hosts four integer values:

the current loglevel
the default level for messages that lack an explicit loglevel
the minimum allowed loglevel
the boot-time default loglevel

you can cause all kernel messages to appear at the console by simply entering:


# echo 8 > /proc/sys/kernel/printk

Redirecting Console Messages

By default, the “console” is the current virtual terminal. To select a different virtual terminal to receive messages, you can issue ioctl(TIOCLINUX) on any console device.

How Messages Get Logged

The printk() function writes messages into a circular buffer that is __LOG_BUF_LEN bytes long: a value from 4 KB to 1 MB chosen while configuring the kernel. The function then wakes any process that is is sleeping in the syslog() system call or that is reading /proc/kmsg.
The dmesg command can be used to look at the content of the buffer without flushing it; actually, the command returns to stdout the whole content of the buffer, whether or not it has already been read.

klogd process retrieves kernel messages and dispatches them to syslogd.
syslog() generates a log message, which will be distributed by syslogd().

Rate Limiting

When using a slow console device (e.g., a serial port), an excessive message rate can also slow down the system or just make it unresponsive.

Printing Device Numbers

The kernel provides a couple of utility macros (defined in <linux/kdev_t.h>) for this purpose:


     int print_dev_t(char *buffer, dev_t dev);
     char *format_dev_t(char *buffer, dev_t dev);

Debugging by Querying

Using the /proc Filesystem

The /proc filesystem is a special, software-created filesystem that is used by the kernel to export information to the world. Each file under /proc is tied to a kernel function that generates the file’s “contents” on the fly when the file is read.
The advantage of a /proc file is that there is no overhead until you actually ask for the data.

Adding files under /proc is discouraged, the recommended way of making information available in new code is via sysf.

Implementing files in /proc

All modules that work with /proc should include <linux/proc_fs.h> to define the proper functions.
When a process reads from your /proc file, the kernel allocates a page of memory (i.e., PAGE_SIZE bytes) where the driver can write data to be returned to user space. That buffer is passed to a method with this interface:


struct proc_dir_entry *proc_create(const char *name, umode_t mode,
       struct proc_dir_entry *parent,
       const struct proc_ops *proc_ops)

void remove_proc_entry(const char *name, struct proc_dir_entry *parent)


struct proc_dir_entry {
...
 union {
  const struct proc_ops *proc_ops;
  const struct file_operations *proc_dir_ops;
 };
...
};

==== the following are deprecated ===


     int (*read_proc)(char *page, char **start, off_t offset, int count, int *eof, void *data);

page
start
offset
count
eof
data

This function should return the number of bytes of data actually placed in the page buffer.

Once you have a read_proc function defined, you need to connect it to an entry in the /proc hierarchy. This is done with a call to create_proc_read_entry:


     struct proc_dir_entry *create_proc_read_entry(const char *name, mode_t mode, struct proc_dir_entry *base, read_proc_t *read_proc, void *data);

Here is the call used by scull to make its /proc function available as /proc/scullmem:


     create_proc_read_entry("scullmem", 0 /* default mode */,
             NULL /* parent dir */, scull_read_procmem,
             NULL /* client data */);

Entries in /proc, of course, should be removed when the module is unloaded.


     remove_proc_entry("scullmem", NULL /* parent dir */);

The seq_file interface

/proc methods have become notorious for buggy implementations when the amount of output grows large(over one page). The seq_file interface provides a simple set of functions for the implementation of large kernel virtual files. It's based on sequence, which is composed of 3 functions: start(), next(), and stop(). The seq_file API starts a sequence when a user read the /proc file.

A sequence begins with the call of the function start(). If the return is a non NULL value, the function next() is called. This function is an iterator, the goal is to go thought all the data. Each time next() is called, the function show() is also called. It writes data values in the buffer read by the user. The function next() is called until it returns NULL. The sequence ends when next() returns NULL, then the function stop() is called.

BE CARREFUL: when a sequence is finished, another one starts. That means that at the end of function stop(), the function start() is called again. This loop finishes when the function start() returns NULL.

The first step, inevitably, is the inclusion of <linux/seq_file.h>. Then you must create four iterator methods

void *start(struct seq_file *sfile, loff_t *pos)

pos


static void *scull_seq_start(struct seq_file *s, loff_t *pos)
     {
         if (*pos >= scull_nr_devs)
             return NULL;   /* It's the end of the sequence, return NULL to stop reading */
         return scull_devices + *pos;
     }

void *next(struct seq_file *sfile, void *v, loff_t *pos)

Its job is to move the iterator forward to the next position in the sequence. The next() function returns a new iterator, or NULL if the sequence is complete. The scull implementation has no cleanup work to do, so its stop method is empty.

void stop(struct seq_file *sfile, void *v);

This is called when the iterator is ended.

int show(struct seq_file *sfile, void *v)

In between these calls, the kernel calls the show method to actually output some- thing interesting to the user space. It should not use printk, however; instead, there is a special set of functions for seq_file output:

int seq_printf(struct seq_file *sfile, const char *fmt, ...)
int seq_putc(struct seq_file *sfile, char c)
int seq_puts(struct seq_file *sfile, const char *s)
int seq_escape(struct seq_file *m, const char *s, const char *esc)
int seq_path(struct seq_file *sfile, struct vfsmount *m, struct dentry *dentry, char *esc)

the show method used in scull is:


static int scull_seq_show(struct seq_file *s, void *v)
    {
        struct scull_dev *dev = (struct scull_dev *) v;
        struct scull_qset *d;
        int i;
        if (down_interruptible(&dev->sem))
            return -ERESTARTSYS;
        seq_printf(s, "\nDevice %i: qset %i, q %i, sz %li\n",
                (int) (dev - scull_devices), dev->qset,
                dev->quantum, dev->size);
        for (d = dev->data; d; d = d->next) { /* scan the list */
            seq_printf(s, "  item at %p, qset at %p\n", d, d->data);
            if (d->data && !d->next) /* dump only the last item */
                for (i = 0; i < dev->qset; i++) {
                    if (d->data[i])
                        seq_printf(s, "    % 4i: %8p\n", i, d->data[i]);
                } 
        }
    up(&dev->sem);
    return 0;
   }

The ioctl Method

As an alternative to using the /proc filesystem, you can implement a few ioctl commands tailored for debugging.
You need another program to issue the ioctl and display the results. This program must be written, compiled, and kept in sync with the module you’re testing.

Debugging by Watching

Sometimes minor problems can be tracked down by watching the behavior of an application in user space.
The strace command is a powerful tool that shows all the system calls issued by a user-space program. strace is most useful for pinpointing runtime errors from system calls.
In the simplest case strace runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process. The name of each system call, its arguments and its return value are printed on standard error or to the file specified with the -o option.

Debugging System Faults

A fault usually results in the destruction of the current process while the system goes on working.
The kernel calls the close operation for any open device when a process dies, your driver can release what was allocated by the open method.

Oops Messages

Almost any address used by the processor is a virtual address and is mapped to physical addresses through a complex structure of page tables.
When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address, and the processor signals a page fault to the operating system. If the address is not valid, the kernel is not able to “page in” the missing address; it (usually) generates an oops if this happens while the processor is in supervisor mode.

An oops displays the processor status at the time of the fault, including the contents of the CPU registers and other seemingly incomprehensible information. The message is generated by printk statements in the fault handler (arch/*/kernel/traps.c).
In general, when you are confronted with an oops, the first thing to do is to look at the location where the problem happened, which is usually listed separately from the call stack.

The stack itself is printed in hex form; you see a symbolic call stack (as shown above) only if your kernel is built with the CONFIG_KALLSYMS option turned on.

Understanding a Kernel Oops!

Setting up the machine to capture an Oops

The running kernel should be compiled with CONFIG_DEBUG_INFO, and syslogd should be running.
Let’s try to generate an Oops message with sample code, and try to understand the dump.


#include <linux/kernel.h>
#include <linux/module.h> 
#include <linux/init.h> 
 
static void create_oops() { 
        *(int *)0 = 0; 
} 
 
static int __init my_oops_init(void) { 
        printk("oops from the module\n"); 
        create_oops(); 
       return (0); 
} 
static void __exit my_oops_exit(void) { 
        printk("Goodbye world\n"); 
} 
 
module_init(my_oops_init); 
module_exit(my_oops_exit);

The associated Makefile for this module is as follows:


obj-m   := oops.o 
KDIR    := /lib/modules/$(shell uname -r)/build
PWD     := $(shell pwd) 
SYM=$(PWD) 
 
all: 
        $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

The running kernel should be compiled with CONFIG_DEBUG_INFO, and syslogd should be running.
Once executed, the module generates the following Oops:

System Hangs

You can prevent an endless loop by inserting schedule() invocations at strategic points. The schedule() call invokes the scheduler and, therefore, allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, the schedule() calls enable you to kill the process after tracing what is happening.
Be sure, however, not to call schedule() any time that your driver is holding a spinlock.

Magic SysRq (Magic System Request) is a kernel hack that enables the kernel to listen to specific key presses and respond by calling a specific kernel function. Magic SysRq is activated via input from the keyboard or a serial line.
Enable Magic SysRq (CONFIG_MAGIC_SYSRQ and CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE respectively):


Kernel hacking  --->
   [*] Magic SysRq key
   (0x1) Enable magic SysRq key functions by default

On amd64 and x86 systems the key combination of "Alt + SysRq + command-key" will result in Magic SysRQ invocation.
command-key:

Other magic SysRq functions exist; see sysrq.txt in the Documentation directory of the kernel source for the full list.
The file /proc/sysrq-trigger is a write- only entry point, where you can trigger a specific sysrq action by writing the associated command character

Debuggers and Related Tools

Using gdb

You should be aware that debugging with gdb has some definite limitation, because a user space debugger peeks into the address space of a running kernel:

You can examine the content of the kernel space
But you can’t be able to:

Set breakpoint
Step through the kernel code

To use gdb, you need to provide:

the uncompressed ELF kernel executable /vmlinux

extract-vmlinux


# sudo -i
# /usr/src/linux-headers-5.4.0-52-generic/scripts/extract-vmlinux /boot/vmlinuz-5.4.0-52-generic > vmlinux

CONFIG_PROC_KCORE

CONFIG_DEBUG_INFO


/boot$ grep CONFIG_PROC_KCORE config-5.4.0-52-generic; grep CONFIG_DEBUG_INFO config-5.4.0-52-generic
CONFIG_PROC_KCORE=y
CONFIG_DEBUG_INFO=y

the core file of the image

kcore

it represents the whole kernel address space

The values displayed will represent what they were at the time of invoking gdb.


# gdb vmlinux /proc/kcore
...
Reading symbols from vmlinux...(no debugging symbols found)...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-5.4.0-52-generic root=UUID=b914dd46-1239-455a-b372-f54'.
#0  0x0000000000000000 in ?? ()
(gdb)

Note that, in order to have symbol information available for gdb, you must compile your kernel with the CONFIG_DEBUG_INFO option set.
It's possible when you package your vmlinuz image, the debug symbols were stripped (when using make-kpkg to build deb package for linux kernel).
You have to use the built vmlinux file under your linux source tree to have those debug symbols.
From within gdb, you can look at kernel variables by issuing the standard gdb commands. For example,


(gdb) p jiffies

prints the number of clock ticks from system boot to the current time.
Numerous capabilities normally provided by gdb are not available when you are working with the kernel. For example,

gdb is not able to modify kernel data; it expects to be running a program to be debugged under its own control before playing with its memory image.
It is also not possible to set breakpoints or watchpoints, or to single-step through kernel functions.

Linux loadable modules are ELF-format executable images which typically has three sections relevant in a debugging session:

.text
.bss
.data

Debugging the Linux kernel using the GDB

Debugging Loadable Modules

Debugging an external Linux kernel module requires some specific actions, especially because the symbols for this module are not part of the main vmlinux symbol file.
Consider the following loadable module (in the source file hellop.c):


#include <linux/init.h>
#include <linux/module.h>
#include <linux/moduleparam.h>

MODULE_LICENSE("Dual BSD/GPL");
                                               

static char *whom = "world";
static int howmany = 1;
module_param(howmany, int, S_IRUGO);
module_param(whom, charp, S_IRUGO);

static int hello_init(void)
{
	int i;
	for (i = 0; i < howmany; i++)
		printk(KERN_ALERT "(%d) Hello, %s\n", i, whom);
	return 0;
}

static void hello_exit(void)
{
	printk(KERN_ALERT "Goodbye, cruel world\n");
}

module_init(hello_init);
module_exit(hello_exit);

nm - list symbols from object file


$ nm hellop.ko
0000000000000050 T cleanup_module
                 U __fentry__
0000000000000050 t hello_exit
0000000000000000 t hello_init
0000000000000000 d howmany
0000000000000000 T init_module
0000000000000000 r _note_6
0000000000000028 r __param_howmany
                 U param_ops_charp
                 U param_ops_int
0000000000000008 r __param_str_howmany
0000000000000000 r __param_str_whom
0000000000000000 r __param_whom
                 U printk
0000000000000000 D __this_module
0000000000000061 r __UNIQUE_ID_depends39
0000000000000014 r __UNIQUE_ID_howmanytype37
0000000000000029 r __UNIQUE_ID_license36
0000000000000076 r __UNIQUE_ID_name37
000000000000006a r __UNIQUE_ID_retpoline38
000000000000003e r __UNIQUE_ID_srcversion40
0000000000000082 r __UNIQUE_ID_vermagic36
0000000000000000 r __UNIQUE_ID_whomtype38
0000000000000008 d whom

objdump - display information about one or more object files

-t, --syms

symbol table


$ objdump -t hellop.ko

hellop.ko:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    d  .note.gnu.build-id	0000000000000000 .note.gnu.build-id
0000000000000000 l    d  .text	0000000000000000 .text
0000000000000000 l    d  .rodata.str1.1	0000000000000000 .rodata.str1.1
0000000000000000 l    d  __mcount_loc	0000000000000000 __mcount_loc
0000000000000000 l    d  .modinfo	0000000000000000 .modinfo
0000000000000000 l    d  __param	0000000000000000 __param
0000000000000000 l    d  .rodata	0000000000000000 .rodata
0000000000000000 l    d  .note.Linux	0000000000000000 .note.Linux
0000000000000000 l    d  .data	0000000000000000 .data
0000000000000000 l    d  .gnu.linkonce.this_module	0000000000000000 .gnu.linkonce.this_module
0000000000000000 l    d  .bss	0000000000000000 .bss
0000000000000000 l    d  .comment	0000000000000000 .comment
0000000000000000 l    d  .note.GNU-stack	0000000000000000 .note.GNU-stack
0000000000000000 l    df *ABS*	0000000000000000 hellop.c
0000000000000000 l     F .text	0000000000000045 hello_init
0000000000000000 l     O .data	0000000000000004 howmany
0000000000000008 l     O .data	0000000000000008 whom
0000000000000050 l     F .text	0000000000000017 hello_exit
0000000000000000 l     O .modinfo	0000000000000014 __UNIQUE_ID_whomtype38
0000000000000000 l     O __param	0000000000000028 __param_whom
0000000000000000 l     O .rodata	0000000000000005 __param_str_whom
0000000000000014 l     O .modinfo	0000000000000015 __UNIQUE_ID_howmanytype37
0000000000000028 l     O __param	0000000000000028 __param_howmany
0000000000000008 l     O .rodata	0000000000000008 __param_str_howmany
0000000000000029 l     O .modinfo	0000000000000015 __UNIQUE_ID_license36
0000000000000000 l    df *ABS*	0000000000000000 hellop.mod.c
000000000000003e l     O .modinfo	0000000000000023 __UNIQUE_ID_srcversion40
0000000000000061 l     O .modinfo	0000000000000009 __UNIQUE_ID_depends39
000000000000006a l     O .modinfo	000000000000000c __UNIQUE_ID_retpoline38
0000000000000076 l     O .modinfo	000000000000000c __UNIQUE_ID_name37
0000000000000082 l     O .modinfo	000000000000002a __UNIQUE_ID_vermagic36
0000000000000000 l     O .note.Linux	0000000000000018 _note_6
0000000000000000 g     O .gnu.linkonce.this_module	0000000000000380 __this_module
0000000000000050 g     F .text	0000000000000017 cleanup_module
0000000000000000         *UND*	0000000000000000 __fentry__
0000000000000000 g     F .text	0000000000000045 init_module
0000000000000000         *UND*	0000000000000000 printk
0000000000000000         *UND*	0000000000000000 param_ops_charp
0000000000000000         *UND*	0000000000000000 param_ops_int

The current gdb session has no idea about those module symbols, you need to tell GDB how to obtain symbol information for those modul.
Information on where the module various ELF sections were loaded into the kernel space can be obtained by looking into the /sys/module/<module_name>/sections directory.


/sys/module/hellop/sections$ ls -al
total 0
drwxr-xr-x 2 root root  0  十  23 16:16 .
drwxr-xr-x 6 root root  0  十  23 16:16 ..
-r-------- 1 root root 19  十  23 16:36 .data
-r-------- 1 root root 19  十  23 16:37 .gnu.linkonce.this_module
-r-------- 1 root root 19  十  23 16:37 __mcount_loc
-r-------- 1 root root 19  十  23 16:37 .note.gnu.build-id
-r-------- 1 root root 19  十  23 16:37 .note.Linux
-r-------- 1 root root 19  十  23 16:37 __param
-r-------- 1 root root 19  十  23 16:37 .rodata
-r-------- 1 root root 19  十  23 16:37 .rodata.str1.1
-r-------- 1 root root 19  十  23 16:37 .strtab
-r-------- 1 root root 19  十  23 16:37 .symtab
-r-------- 1 root root 19  十  23 16:16 .text

/sys/module/hellop/sections$ sudo cat .text .rodata .data
0xffffffffc09e4000
0xffffffffc09e50b8
0xffffffffc09e6000

Adding the module symbol file to the GDB environment is by using the add-symbol-file GDB command:


(gdb) add-symbol-file <module.ko_path> <address of module's text section> \
  -s .data <address of module's data section>  -s .bss <address of module's bss section if available>

We can now tell gdb all about those sections:


(gdb) add-symbol-file /home/jerry/ldd4/misc-modules/hellop.ko 0xffffffffc09e4000 -s .rodata 0xffffffffc09e50b8 -s .data 0xffffffffc09e6000
(y or n) y
Reading symbols from /home/jerry/ldd4/misc-modules/hellop.ko...(no debugging symbols found)...done.

The debug symbol was not enabled in the built module. To enable the debug support, prefix the following in the Makefile then build the module again:


ccflags-y += -g

Load the debug symbol again:


(gdb) add-symbol-file /home/jerry/ldd4/misc-modules/hellop.ko 0xffffffffc09e4000 -s .rodata 0xffffffffc09e50b8 -s .data 0xffffffffc09e6000
...
add symbol table from file "/home/jerry/ldd4/misc-modules/hellop.ko" at
	.text_addr = 0xffffffffc09e4000
	.rodata_addr = 0xffffffffc09e50b8
	.data_addr = 0xffffffffc09e6000
(y or n) y
Reading symbols from /home/jerry/ldd4/misc-modules/hellop.ko...done.
(gdb) p howmany
$1 = 1
(gdb) p whom
$2 = 0xffffffffc09e504e "world"
(gdb)

Chapter 5: Concurrency and Race Conditions

Device driver programmers must now factor concurrency into their designs from the beginning, and they must have a strong understanding of the facilities provided by the kernel for concurrency management.

Race conditions are a result of uncontrolled access to shared data. When the wrong access pattern happens, something unexpected results.

Concurrency and Its Management

Race conditions come about as a result of shared access to resources.
So the first rule of thumb to keep in mind as you design your driver is to avoid shared resources whenever possible. The most obvious application of this idea is to avoid the use of global variables.

Sharing is a fact of life. The usual technique for access management is called locking or mutual exclusion — making sure that only one thread of execution can manipulate a shared resource at any time.

There are two main types of kernel locks.
The fundamental type is the spinlock (include/asm/spinlock.h), which is a very simple single-holder lock: if you can't get the spinlock, you keep trying (spinning) until you can (busy wait). Spinlocks are very small and fast, and can be used anywhere.
The second type is a mutex (include/linux/mutex.h), if you can't lock a mutex, your task will suspend itself, and be woken up when the mutex is released. This means the CPU can do something else while you are waiting.

Cheat Sheet For Locking:

If you are in a process context (any syscall) and want to lock other process out, use a mutex. You can take a mutex and sleep ( copy_from_user*( or kmalloc(x,GFP_KERNEL) ).
Otherwise (== data can be touched in an interrupt), use spin_lock_irqsave() and spin_unlock_irqrestore().
Avoid holding spinlock for more than 5 lines of code and across any function call (except accessors like readb).

Spinlock

Unlike semaphores, spinlocks may be used in code that cannot sleep, such as interrupt handlers.
The spinlock is a low-level synchronization mechanism which can be in two states:

acquired
released

Code wishing to take out a particular lock tests the relevant bit. If the lock is available, the “locked” bit is set and the code continues into the critical section. If, instead, the lock has been taken by somebody else, the code goes into a tight loop where it repeatedly checks the lock until it becomes available.
The “test and set” operation must be done in an atomic manner so that only one thread can obtain the lock, even if several are spinning at any given time.

Locks and Uniprocessor Kernels

Under non-preemptive scheduling, once the CPU has been allocated to a process, the process keeps the CPU until it releases the CPU either by terminating or by switching to the waiting state.

For kernels compiled:

without CONFIG_SMP and without CONFIG_PREEMPT

compiled without CONFIG_SMP, but CONFIG_PREEMPT is set

You should always test your locking code with CONFIG_SMP and CONFIG_PREEMPT enabled, even if you don't have an SMP test box, because it will still catch some kinds of locking bugs.

Introduction to the Spinlock API

The spinlock is represented by the spinlock_t type in the Linux kernel:


typedef struct raw_spinlock {
	arch_spinlock_t raw_lock;
#ifdef CONFIG_DEBUG_SPINLOCK
	unsigned int magic, owner_cpu;
	void *owner;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map dep_map;
#endif
} raw_spinlock_t;

typedef struct spinlock {
	union {
		struct raw_spinlock rlock;

#ifdef CONFIG_DEBUG_LOCK_ALLOC
# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))
		struct {
			u8 __padding[LOCK_PADSIZE];
			struct lockdep_map dep_map;
		};
#endif
	};
} spinlock_t;

where the arch_spinlock_t represents architecture-specific spinlock implementation.

include


#include <linux/spinlock.h>

init


spinlock_t my_lock = SPIN_LOCK_UNLOCKED; // at compiling time
void spin_lock_init(spinlock_t *lock);   // at run-time

lock


void spin_lock(spinlock_t *lock);

unlock


void spin_unlock(spinlock_t *lock);

Spinlocks and Atomic Context

Any code holding a spinlock must be atomic.
Whenever the kernel code holds a spinlock, preemption is disabled on the relevant processor.

Many kernel functions can sleep( ie. call schedule() ) and are not well documented. For ex.,

Copying data to or from user space

This operation clearly requires a sleep

Memory allocation

kmalloc

wait for more memory to become available

The Spinlock Functions

There are actually four functions that can lock a spinlock:

void spin_lock(spinlock_t *lock)
void spin_lock_irqsave(spinlock_t *lock, unsigned long flags)

This disables interrupts (on the relavent processor only) before acquiring the spinlock; the previous interrupt state is stored

void spin_lock_irq(spinlock_t *lock)

disable interrupts

void spin_lock_bh(spinlock_t *lock)

This disables software interrupts before acquiring the lock

The corresponding functions to release a spinlock:

void spin_unlock(spinlock_t *lock)
void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
void spin_unlock_irq(spinlock_t *lock)
void spin_unlock_bh(spinlock_t *lock)

There is also a set of nonblocking spinlock operations:

int spin_trylock(spinlock_t *lock)
int spin_trylock_bh(spinlock_t *lock)

Spinlock in Linux kernel example programming

Reader-Writer Spin Locks

Sometimes, lock usage can be clearly divided into readers and writers.
For example, consider a list that is both updated and searched. When the list is updated (written to), it is important that no other threads of execution concurrently write to or read from the list. Writing demands mutual exclusion. On the other hand, when the list is searched (read from), it is only important that nothing else write to the list. Multiple concurrent readers are safe so long as there are no writers.
Linux provides reader-writer spin locks. Reader-writer spin locks provide separate reader and writer variants of the lock. One or more readers can concurrently hold the reader lock. The writer lock, conversely, can be held by at most one writer with no concurrent readers.

Usage is similar to spin locks.

init


#include <linux/rwlock.h>

rwlock_t mr_rwlock = RW_LOCK_UNLOCKED;

in the reader code path


read_lock(&mr_rwlock);
/* critical section (read only) ... */
read_unlock(&mr_rwlock);

in the writer code path


write_lock(&mr_rwlock);
/* critical section (read and write) ... */
write_unlock(&mr_lock);

A final important consideration in using the Linux reader-writer spin locks is that they favor readers over writers. If the read lock is held and a writer is waiting for exclusive access, readers that attempt to acquire the lock will continue to succeed. The spinning writer does not acquire the lock until all readers release the lock. Therefore, a sufficient number of readers can starve pending writers. This is important to keep in mind when designing your locking.

Spin locks provide a very quick and simple lock. The spinning behavior is optimal for short hold times and code that cannot sleep (interrupt handlers, for example). In cases where the sleep time might be long or you potentially need to sleep while holding the lock, the semaphore is a solution.

Semaphores and Mutexes

Ｗe must set up critical sections: code that can be executed by only one thread at any given time.
Not all critical sections are the same, some critical sections can't be operated immediately, so the kernel provides different primitives for different needs.
When a Linux process reaches a point where it cannot make any further processes, it goes to sleep (or “block”), yielding the processor to somebody else until some future time when it can get work done again.

A semaphore is a single integer value combined with a pair of functions that are typically called P and V. A process wishing to enter a critical section will call P on the relevant semaphore; if the semaphore’s value is greater than zero, that value is decremented by one and the process continues. If the semaphore’s value is 0 (or less), the process must wait until somebody else releases the semaphore. Unlocking a semaphore is accomplished by calling V; this function increments the value of the semaphore and, if necessary, wakes up processes that are waiting for the semaphore.
When semaphores are used for mutual exclusion — keeping multiple processes from running within a critical section simultaneously — their value will be initially set to 1.

Such a semaphore can be held only by a single process or thread at any given time. A semaphore used in this mode is sometimes called a mutex(“mutual exclusion”).

The Linux Semaphore Implementation

Semaphores in Linux are sleeping locks. When a task attempts to acquire a semaphore that is already held, the semaphore places the task onto a wait queue and puts the task to sleep. The processor is then free to execute other code. When the processes holding the semaphore release the lock, one of the tasks on the wait queue is awakened so that it can then acquire the semaphore.

To use semaphores, kernel code must include <linux/semaphore.h>. The relevant type is struct semaphore:


/* Please don't access any members of this structure directly */
struct semaphore {
 raw_spinlock_t  lock;
 unsigned int  count;
 struct list_head wait_list;
};

#define __SEMAPHORE_INITIALIZER(name, n)    \
{         \
 .lock  = __RAW_SPIN_LOCK_UNLOCKED((name).lock), \
 .count  = n,      \
 .wait_list = LIST_HEAD_INIT((name).wait_list),  \
}

#define DEFINE_SEMAPHORE(name) \
 struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)

static inline void sema_init(struct semaphore *sem, int val)
{
 static struct lock_class_key __key;
 *sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
 lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);
}

extern void down(struct semaphore *sem);
extern int __must_check down_interruptible(struct semaphore *sem);
extern int __must_check down_killable(struct semaphore *sem);
extern int __must_check down_trylock(struct semaphore *sem);
extern int __must_check down_timeout(struct semaphore *sem, long jiffies);
extern void up(struct semaphore *sem);

Usually, semaphores are used in a mutex mode.
“down” refers to the fact that the function decrements the value of the semaphore and, perhaps after putting the caller to sleep for a while to wait for the semaphore to become available. There are three versions of down:

void down(struct semaphore *sem)

int down_interruptible(struct semaphore *sem)

Acquire the semaphore unless interrupted

int down_trylock(struct semaphore *sem)

Try to acquire the semaphore, without waiting

Once a thread has successfully called one of the versions of down, it is said to be “holding” the semaphore.
Once up has been called, the caller no longer holds the semaphore.

Using Semaphores

Let’s define a structure:


     struct hello_dev {
        char   priv_data;
        struct semaphore sem;     /* mutual exclusion semaphore     */
        struct cdev cdev;     /* Char device structure      */
};

We have chosen to use a separate semaphore for each virtual device.

Make sure that no accesses to the hello_dev data structure are made without holding the semaphore when the driver's method is called:


if (down_interruptible(&dev->sem))
            return -ERESTARTSYS;

The Driver's methods must release the semaphore before leaving:


out:
       up(&dev->sem);
       return retval;

Reader-Writer Semaphores

Semaphores, like spin locks, also come in a reader-writer flavor. All reader-writer semaphores are mutexes (that is, their usage count is one).
Reader-writer semaphores are represented by the struct rw_semaphore type, which is declared in <linux/rwsem.h>.

init


static DECLARE_RWSEM(name); // or init_rwsem(struct rw_semaphore *sem)

in reader's path


/* attempt to acquire the semaphore for reading ... */
down_read(&mr_rwsem);

/* critical region (read only) ... */

/* release the semaphore */
up_read(&mr_rwsem);

in writer's path


/* attempt to acquire the semaphore for writing ... */
down_write(&mr_rwsem);

/* critical region (read and write) ... */

/* release the semaphore */
up_write(&mr_sem);

Reader-writer semaphores, as spin locks of the same nature, should not be used unless there is a clear separation between write paths and read paths in your code. Supporting the reader-writer mechanisms has a cost, and it is worthwhile only if your code naturally splits along a reader/writer boundary.

Semaphores were only in older 2.6.16 kernels, now mutex is used to replace earlier semaphores implementation, <linux/mutex.h>:


/*
 * Simple, straightforward mutexes with strict semantics:
 *
 * - only one task can hold the mutex at a time
 * - only the owner can unlock the mutex
 * - multiple unlocks are not permitted
 * - recursive locking is not permitted
 * - a mutex object must be initialized via the API
 * - a mutex object must not be initialized via memset or copying
 * - task may not exit with mutex held
 * - memory areas where held locks reside must not be freed
 * - held mutexes must not be reinitialized
 * - mutexes may not be used in hardware or software interrupt contexts such as tasklets and timers
 */
struct mutex {
 atomic_long_t  owner;
 spinlock_t  wait_lock;
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 struct optimistic_spin_queue osq; /* Spinner MCS lock */
#endif
 struct list_head wait_list;
#ifdef CONFIG_DEBUG_MUTEXES
 void   *magic;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
 struct lockdep_map dep_map;
#endif
};

mutex_init(struct mutex *lock)


#define mutex_init(mutex)      \
do {         \
 static struct lock_class_key __key;    \
         \
 __mutex_init((mutex), #mutex, &__key);    \
} while (0)

void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
 atomic_long_set(&lock->owner, 0);
 spin_lock_init(&lock->wait_lock);
 INIT_LIST_HEAD(&lock->wait_list);
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 osq_lock_init(&lock->osq);
#endif

 debug_mutex_init(lock, name, key);
}
EXPORT_SYMBOL(__mutex_init);

C string literal

void __sched mutex_lock(struct mutex * lock)

task

sleep until it can get it

down

void __sched mutex_unlock(struct mutex * lock);

int mutex_is_locked (struct mutex * lock)
ww_mutex_unlock
int __sched mutex_lock_interruptible (struct mutex * lock)

return 0 if the mutex has been acquired

signal

mutex_trylock

atomic_dec_and_mutex_lock

Locking Traps

In this section, we take a quick look at things that can go wrong.

Ambiguous Rules

When you create a resource that can be accessed concurrently, you should define which lock will control that access.
In the case of scull, the design decision taken was to require all functions invoked directly from system calls to acquire the semaphore applying to the device structure that is accessed.

Lock Ordering Rules

Try to avoid situations where you need more than one lock.

Fine- Versus Coarse-Grained Locking

The 1st Linux kernel contained exactly one spinlock. The big kernel lock turned the entire kernel into one large critical section; only one CPU could be executing kernel code at any given time.
A modern kernel can contain thousands of locks, each protecting one small resource. This sort of fine-grained locking can be good for scalability; it allows each processor to work on its specific task without contending for locks used by other processors.

Locking in a device driver is usually relatively straightforward; you can have a single lock that covers everything you do, or you can create one lock for every device instance you manage.

Alternatives to Locking

Lock-Free Algorithms

The circular buffer algorithm involves a producer placing data into one end of an array, while the consumer removes data from the other. So a circular buffer requires an array and two index values to track where the next new value goes and which value should be removed from the buffer next. The producer is the only thread that is allowed to modify the write index and the array location it points to.

There is a generic circular buffer implementation available in the kernel; see <linux/kfifo.h> for information on how to use it.

Atomic Variables

In general, safe access to a global variable is ensured by using atomic operations.
Atomic operations provide instructions that execute atomically without interruption.
The kernel provides two sets of interfaces for atomic operations: one operates on integers and another that operates on individual bits.

Atomic Integer Operations

The atomic integer methods operate on a atomic integer type called atomic_t, defined in <asm/atomic.h>.
An atomic_t holds an int value on all supported architectures. Because of the way this type works on some processors, however, the full integer range may not be available; thus, you should not count on an atomic_t holding more than 24 bits.


atomic_t v;                   /* define v */
atomic_t u = ATOMIC_INIT(0);     /* define u and initialize it to zero */

atomic_set(&v, 4);     /* v = 4 (atomically) */
atomic_add(2, &v);     /* v = v + 2 = 6 (atomically) */
atomic_inc(&v);        /* v = v + 1 = 7 (atomically) */

printk("%d\n", atomic_read(&v)); /* will print "7" */

atomic_t data items must be accessed only through these functions. If you pass an atomic item to a function that expects an integer argument, you’ll get a compiler error.

Atomic Bitwise Operations

In addition to atomic integer operations, the kernel also provides a family of functions that operate at the bit level. Not surprisingly, they are architecture specific and defined in <asm/bitops.h>.
The operations are on generic memory addresses of the number whose bit to be operated.
The arguments in the bitwise functions are a bit index and a pointer to the number.


unsigned long word = 0;

set_bit(0, &word);       /* bit zero is now set (atomically) */
set_bit(1, &word);       /* bit one is now set (atomically) */
printk("%ul\n", word);   /* will print "3" */
clear_bit(1, &word);     /* bit one is now unset (atomically) */
change_bit(0, &word);    /* bit zero is flipped; now it is unset (atomically) */

/* atomically sets bit zero and returns the previous value (zero) */
if (test_and_set_bit(0, &word)) {
        /* never true     */
}

/* the following is legal; you can mix atomic bit instructions with normal C */
word = 7;

seqlocks

Read-Copy-Update

Chapter 6: Advanced Char Driver Operations

ioctl

In user space, the ioctl system call has the following prototype:


 int ioctl(int fd, unsigned long cmd, ...);

the dots in the prototype represent a single optional argument, traditionally identified as char *argp.
Using a pointer in the 3rd argument is the way to pass arbitrary data to the ioctl call; the device is then able to exchange any amount of data with user space.
Each ioctl command is, essentially, a separate, usually undocumented system call.

The ioctl driver method has a prototype:


int (*ioctl) (struct inode *inode, struct file *filp,
                   unsigned int cmd, unsigned long arg)

The inode and filp pointers are the values corresponding to the file descriptor fd passed on by the application.
The cmd argument is passed from the user unchanged, and the optional arg argument is passed in the form of an unsigned long. If the invoking program doesn’t pass a third argument, the arg value received by the driver operation is undefined.

Choosing the ioctl Commands

The ioctl command numbers should be unique across the system in order to prevent errors caused by issuing the right command to the wrong device. If each ioctl number is unique, the application gets an EINVAL error rather than succeeding in doing something unintended.
To help programmers create unique ioctl command codes, these codes have been split up into several bitfields:

type

use it throughout the driver

number
direction

_IOC_NONE

_IOC_READ

_IOC_WRITE

_IOC_READ

_IOC_WRITE

data transfer is seen from the application’s point of view

sizet

you should first check include/asm/ioctl.h and Documentation/ioctl-number.txt for well defined ioctl commands.
The header file , which is included by , defines macros that help set up the ioctl command numbers :


#define _IO(type,nr)  _IOC(_IOC_NONE,(type),(nr),0)
#define _IOR(type,nr,size) _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOW(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))

_IO(type,nr)
_IOR(type,nr,datatype)
_IOW(type,nr,datatype)
_IOWR(type,nr,datatype)

The type and number fields are passed as arguments, and the size field is derived by applying sizeof to the datatype argument.
Macros to decode the numbers: _IOC_DIR(nr), _IOC_TYPE(nr), _IOC_NR(nr), and _IOC_SIZE(nr).
Here is how some ioctl commands are defined in scull:


#define SCULL_IOC_MAGIC  'k'

#define SCULL_IOCRESET    _IO(SCULL_IOC_MAGIC, 0)
#define SCULL_IOCSQUANTUM _IOW(SCULL_IOC_MAGIC,  1, int)
.
#define SCULL_IOCGQSET    _IOR(SCULL_IOC_MAGIC,  6, int)
.
#define SCULL_IOCXQSET    _IOWR(SCULL_IOC_MAGIC,10, int)
.
#define SCULL_IOC_MAXNR 14

The Return Value

It’s pretty common to return -EINVAL in response to an invalid ioctl command.

The Predefined Commands

A few commands are recognized by the kernel. Note that these commands, when applied to your device, are decoded before your own file operations are called.
Device driver writers are interested only in the group of commands issued on any file, whose magic number is “T.”
The following ioctl commands are predefined for any file, including device-special files:

FIOCLEX
FIONCLEX
FIOASYNC
FIOQSIZE
FIONBIO

Using the ioctl Argument

If the extra argument is an integer, it’s easy: it can be used directly. If it is a pointer, however, some care must be taken.
When a pointer is used to refer to user space, we must ensure that the user address is valid.
To start, address verification (without transferring data) is implemented by the function access_ok, which is declared in <asm/uaccess.h>:


     int access_ok(int type, const void *addr, unsigned long size);

where:

type

VERIFY_READ

VERIFY_WRITE

addr
size

access_ok() checks the address interval [ addr , addr + size – 1 ] then returns a boolean value: 1 for success (access is OK) and 0 for failure (access is not OK).
The scull source exploits the bitfields in the ioctl number to check the arguments before the switch:


int err = 0, tmp;
int retval = 0;
/*
* extract the type and number bitfields, and don't decode
* wrong cmds: return ENOTTY (inappropriate ioctl) before access_ok() */
if (_IOC_TYPE(cmd) != SCULL_IOC_MAGIC) return -ENOTTY;
if (_IOC_NR(cmd) > SCULL_IOC_MAXNR) return -ENOTTY;
/*
 * the direction is a bitmask, and VERIFY_WRITE catches R/W
 * transfers. `Type' is user-oriented, while
 * access_ok is kernel-oriented, so the concept of "read" and
 * "write" is reversed
 */
if (_IOC_DIR(cmd) & _IOC_READ)
     err = !access_ok(VERIFY_WRITE, (void __user *)arg, _IOC_SIZE(cmd));
else if (_IOC_DIR(cmd) & _IOC_WRITE)
     err = !access_ok(VERIFY_READ, (void __user *)arg, _IOC_SIZE(cmd));
if (err) return -EFAULT;

There are the main single-value transfer routines defined in <asm-generic/uaccess.h>. They automatically use the right size if we just have the right pointer type. This version just falls back to copy_{from,to}_user, which should provide a fast-path for small values. They are relatively fast and should be called instead of copy_to_user whenever single values are being transferred.


#define __put_user(x, ptr) \
({        \
 __typeof__(*(ptr)) __x = (x);    \
 int __pu_err = -EFAULT;     \
        __chk_user_ptr(ptr);                                    \
 switch (sizeof (*(ptr))) {    \
 case 1:       \
 case 2:       \
 case 4:       \
 case 8:       \
  __pu_err = __put_user_fn(sizeof (*(ptr)), \
      ptr, &__x);  \
  break;      \
 default:      \
  __put_user_bad();    \
  break;      \
  }       \
 __pu_err;      \
})

#define put_user(x, ptr)     \
({        \
 void __user *__p = (ptr);    \
 might_fault();      \
 access_ok(VERIFY_WRITE, __p, sizeof(*ptr)) ?  \
  __put_user((x), ((__typeof__(*(ptr)) __user *)__p)) : \
  -EFAULT;     \
})

put_user() checks to ensure that the process is able to write to the given memory address.
__put_user() should only be used if the memory region has already been verified with access_ok.
get_user(local, ptr) and __get_user(local, ptr) behave like put_user and __put_user, but transfer data in the opposite direction.

Capabilities and Restricted Operations

For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero).
Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list).
Starting with kernel 2.2, Linux divides the privileges into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute. In this way, a particular user (or program) can be empowered to perform a specific privileged operation without giving away the ability to perform other unrelated operations. There are two system calls used to allow capabilities to be managed from user space: capget and capset.
The full set of capabilities can be found in <linux/capability.h>.
Capability checks are performed with the capable() function (defined in <linux/sched.h>):


      int capable(int capability);

Some system administration operations use application bound to this capability CAP_SYS_ADMIN. You can check if a user’s privilege has this level :


      if (! capable(CAP_SYS_ADMIN) )
             return -EPERM;

The Implementation of the ioctl Commands

The scull implementation of ioctl only transfers the configurable parameters of the device:


switch(cmd) {
  case SCULL_IOCRESET:
    scull_quantum = SCULL_QUANTUM;
    scull_qset = SCULL_QSET;
    break;
  case SCULL_IOCSQUANTUM: /* Set: arg points to the value */
    if (! capable (CAP_SYS_ADMIN))
       return -EPERM;
    retval = __get_user(scull_quantum, (int __user *)arg); break;
  case SCULL_IOCTQUANTUM: /* Tell: arg is the value */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    scull_quantum = arg;
    break;
  case SCULL_IOCGQUANTUM: /* Get: arg is pointer to result */
    retval = __put_user(scull_quantum, (int __user *)arg);
    break;
  case SCULL_IOCQQUANTUM: /* Query: return it (it's positive) */
    return scull_quantum;
  case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
    if (! capable (CAP_SYS_ADMIN))
       return -EPERM;
    tmp = scull_quantum;
    retval = __get_user(scull_quantum, (int __user *)arg); if (retval == 0)
    retval = __put_user(tmp, (int __user *)arg); break;
  case SCULL_IOCHQUANTUM: /* sHift: like Tell + Query */
    if (! capable (CAP_SYS_ADMIN))
       return -EPERM;
    tmp = scull_quantum;
    scull_quantum = arg;
    return tmp;
  default:  /* redundant, as cmd was checked against MAXNR */
         return -ENOTTY;
 }
return retval;

The corresponding ways to pass and receive arguments from the caller’s point of view (i.e., from user space):


    ioctl(fd,SCULL_IOCSQUANTUM, &quantum);
    ioctl(fd,SCULL_IOCTQUANTUM, quantum);
    ioctl(fd,SCULL_IOCGQUANTUM, &quantum);
    quantum = ioctl(fd,SCULL_IOCQQUANTUM);
    ioctl(fd,SCULL_IOCXQUANTUM, &quantum);
    quantum = ioctl(fd,SCULL_IOCHQUANTUM, quantum);

Device Control Without ioctl

Controlling by write is definitely the way to go for those devices that don’t transfer data but just respond to commands, such as robotic devices.

For instance, the “device” is simply a pair of old stepper motors, the driver interprets what is being written as ASCII commands and converts the requests to sequences of impulses that manipulate the stepper motors.

Blocking I/O

This section shows how to put a process to sleep and wake it up again later on.

Introduction to Sleeping

When a process is put to sleep, it is marked as being in a special state and removed from the scheduler’s run queue.
Never sleep when you are running in an atomic context such as holding a spinlock.
Making it possible for your sleeping process to be found, it is accomplished through a data structure called a wait queue. A wait queue represents a set of sleeping processes, which are woken up by the kernel when some condition becomes true.
Wait queues are implemented as doubly linked lists whose elements include pointers to process descriptors. The doubly linked lists must be protected from concurrent accesses and is achieved by the spinlock_t lock in the wait queue head.
Each wait queue is identified by a wait queue head, a data structure of type wait_queue_head_t defined in <linux/wait.h>:


struct wait_queue_entry {
	unsigned int		flags;
	void			*private;
	wait_queue_func_t	func;
	struct list_head	entry;
};

struct wait_queue_head {
 spinlock_t  lock;
 struct list_head head;
};
typedef struct wait_queue_head wait_queue_head_t;

/*
 * Macros for declaration and initialisaton of the datatypes
 */

#define __WAITQUEUE_INITIALIZER(name, tsk) {     \
 .private = tsk,       \
 .func  = default_wake_function,    \
 .entry  = { NULL, NULL } }

#define DECLARE_WAITQUEUE(name, tsk)      \
 struct wait_queue_entry name = __WAITQUEUE_INITIALIZER(name, tsk)

#define __WAIT_QUEUE_HEAD_INITIALIZER(name) {     \
 .lock  = __SPIN_LOCK_UNLOCKED(name.lock),   \
 .head  = { &(name).head, &(name).head } }

#define DECLARE_WAIT_QUEUE_HEAD(name) \
 struct wait_queue_head name = __WAIT_QUEUE_HEAD_INITIALIZER(name)

extern void __init_waitqueue_head(struct wait_queue_head *wq_head, const char *name, struct lock_class_key *);

#define init_waitqueue_head(wq_head)      \
 do {         \
  static struct lock_class_key __key;    \
          \
  __init_waitqueue_head((wq_head), #wq_head, &__key);  \
 } while (0)

A wait queue head can be initialized

statically


     DECLARE_WAIT_QUEUE_HEAD(my_queue);

dynamically


     wait_queue_head_t my_queue;
     init_waitqueue_head(&my_queue);

Simple Sleeping

The simplest way of sleeping in the Linux kernel is a macro called wait_event(). It combines handling the details of sleeping with a check on the condition a process is waiting for.


     wait_event(queue, condition)
     wait_event_interruptible(queue, condition)
     wait_event_timeout(queue, condition, timeout)
     wait_event_interruptible_timeout(queue, condition, timeout)

The condition is an arbitrary boolean expression that is evaluated by the macro.
The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the wait queue is woken up.

The preferred alternative is wait_event_interruptible(), which can be interrupted by signals. This version returns an integer value that you should check: a nonzero value means your sleep was interrupted by some sort of signal, and your driver should probably return -ERESTARTSYS. A zero value means the condition is true and the waitqueue is woken up.
On the other side, the basic function that wakes up sleeping processes is called wake_up.


     void wake_up(wait_queue_head_t *queue);
     void wake_up_interruptible(wait_queue_head_t *queue);

wake_up() wakes up all processes waiting on the given queue.

A sample module called sleepy which implements a device with simple behavior:

Any process that attempts to read from the device is put to sleep.
Whenever a process writes to the device, all sleeping processes are awakened.


     static DECLARE_WAIT_QUEUE_HEAD(wq);
     static int flag = 0;

     ssize_t sleepy_read (struct file *filp, char __user *buf, size_t count, loff_t *pos) {
         printk(KERN_DEBUG "process %i (%s) going to sleep\n", current->pid, current->comm);
         wait_event_interruptible(wq, flag != 0);
         flag = 0;
         printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
         return 0; /* EOF */
     }

     ssize_t sleepy_write (struct file *filp, const char __user *buf, size_t count, loff_t *pos)
     {
         printk(KERN_DEBUG "process %i (%s) awakening the readers...\n", current->pid, current->comm);
         flag = 1;
         wake_up_interruptible(&wq);
         return count; /* succeed, to avoid retrial */
     }

Blocking and Nonblocking Operations

Explicitly nonblocking I/O is indicated by the O_NONBLOCK flag in filp->f_flags. O_NDELAY flag is an alternate name for O_NONBLOCK, accepted for compatibility with System V code.
The blocking operation is the default standard semantics:

If a process calls read() but no data is (yet) available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less than the amount requested in the count argument to the method.

The input buffer is required to avoid losing data that arrives when nobody is reading.

If a process calls write() and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write() call succeeds, although the data may be only partially written if there isn’t room in the buffer for the count bytes that were requested.

The output buffer is useful for increasing performance of the hardware.

A Blocking I/O Example

This example is taken from the scullpipe driver; it is a special form of scull that implements a pipe-like device.
This driver can be run without requiring any particular hardware or an interrupt handler:

writing process wakes the reading process
reading processes are used to wake writer processes that are waiting for buffer space to become available

The device driver uses a device structure that contains two wait queues and a buffer.


struct scull_pipe {
  wait_queue_head_t inq, outq;
  char *buffer, *end;
  int buffersize;
  char *rp, *wp;
  int nreaders, nwriters;
  struct fasync_struct *async_queue; /* asynchronous readers */
  struct semaphore sem;              /* mutual exclusion semaphore */
  struct cdev cdev;      
};

The read implementation manages both blocking and nonblocking input and looks like this:


static ssize_t scull_p_read (struct file *filp, char __user *buf, size_t count, loff_t *f_pos)
{
    struct scull_pipe *dev = filp->private_data;
    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;

    while (dev->rp == dev->wp) { /* nothing to read */ 
        up(&dev->sem); /* release the lock */
        if (filp->f_flags & O_NONBLOCK)
            return -EAGAIN;
        PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
        if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
            return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
        /* otherwise loop, but first reacquire the lock */
        if (down_interruptible(&dev->sem))
            return -ERESTARTSYS;
    }

    /* ok, data is there, return something */
    if (dev->wp > dev->rp)
                count = min(count, (size_t)(dev->wp - dev->rp));
    else /* the write pointer has wrapped, return data up to dev->end */
                count = min(count, (size_t)(dev->end - dev->rp));
    if (copy_to_user(buf, dev->rp, count)) {
                up (&dev->sem);
                return -EFAULT;
    }
    dev->rp += count;
    if (dev->rp == dev->end)
               dev->rp = dev->buffer; /* wrapped */
    up (&dev->sem);
    /* finally, awake any writers and return */
    wake_up_interruptible(&dev->outq);
    PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
    return count;
}

When we exit from the while loop, we know that the semaphore is held and the buffer contains data that we can use.

Task Structure and Process Table

Every process under Linux is dynamically allocated a struct task_struct structure.


/* Used in tsk->state: */
#define TASK_RUNNING			0x0000
#define TASK_INTERRUPTIBLE		0x0001
#define TASK_UNINTERRUPTIBLE		0x0002
#define __TASK_STOPPED			0x0004
#define __TASK_TRACED			0x0008

struct task_struct {
  ...
  volatile long			state;
  ...
};

Under Linux, there are three kinds of processes:

the idle thread(s)
kernel threads

kernel mode

user tasks

do_fork

Advanced Sleeping

To get an understanding of what is really going on when a process sleeps.

How a process sleeps

The way in which an operating system chooses which process at a given time has access to a system’s CPU(s) is controlled by a scheduler.
The job of a scheduler is to arbitrate access to the current CPU between multiple processes.
Linux triggers the scheduler by using a timer interrupt. The amount of time between the timer interrupt is called a timeslice.
Schedulers typically use some type of process queue to manage the execution of processes on the system. In Linux, this process queue is called the run queue.
In Linux, the run queue is composed of two priority arrays:

Active
Expired

From a high level, the scheduler’s job in Linux is to take the highest priority active processes, let them use the CPU to execute, and place them in the expired array when they use up their timeslice.

When a timer event occurs, the current process is put on hold and the Linux kernel itself takes control of the CPU.
When the timer event finishes, the Linux kernel normally passes control back to the process that was put on hold. However, when the held process has been marked as needing rescheduling, the kernel calls schedule() to choose which process to activate instead of the process that was executing before the kernel took control. The process that was executing before the kernel took control is called the current process.
Whenever you call schedule(), you are telling the kernel to consider which process should be running and to switch control to that process if necessary. So you never know how long it will be before schedule returns to your code.

There are several task states defined in <linux/sched.h>, two states are associated with sleeping :

TASK_RUNNING

able to run

TASK_INTERRUPTIBLE

asleep

TASK_UNINTERRUPTIBLE

asleep and ignore signals

Tasks that are sleeping (blocked) are in one of these 2 special non-runnable states.
Both types of sleeping tasks sit on a wait queue, waiting for an event to occur, and are not runnable.
This is important because without this special state, the scheduler would select tasks that did not want to run or, worse, sleeping would have to be implemented as busy looping.

Sleeping is handled via wait queues. A wait queue is a simple list of processes waiting for an event to occur.
Wait queues are represented in the kernel by wake_queue_head_t.


struct list_head {
 struct list_head *next, *prev;
};

struct wait_queue_head {
 spinlock_t  lock;
 struct list_head head;
};
typedef struct wait_queue_head wait_queue_head_t;

The wait_queue_t structure contains information about the sleeping process and exactly how it would like to be woken up.
Wait queues are created statically via DECLARE_WAITQUEUE() or dynamically via init_waitqueue_head().


#define __WAITQUEUE_INITIALIZER(name, tsk) {     \
 .private = tsk,       \
 .func  = default_wake_function,    \
 .entry  = { NULL, NULL } }

#define DECLARE_WAITQUEUE(name, tsk)      \
 struct wait_queue_entry name = __WAITQUEUE_INITIALIZER(name, tsk)

For ex.,


DECLARE_WAITQUEUE(my_wait, current)

Manual sleeps

The task performs the following steps to add itself to a wait queue:

the creation and initialization of a wait queue entry


		DEFINE_WAIT(my_wait); // this is equivalent to DECLARE_WAITQUEUE(my_wait, current)


#define DEFINE_WAIT_FUNC(name, function)     \
 struct wait_queue_entry name = {     \
  .private = current,     \
  .func  = function,     \
  .entry  = LIST_HEAD_INIT((name).entry),   \
 }

#define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function)

Adds itself to a wait queue via add_wait_queue()


void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);

wake_up

change the process state by prepare_to_wait()


while (!condition) { /* condition is the event that we are waiting for */
    prepare_to_wait(&q, &my_wait, state); 
    if (signal_pending(current)) /* handle signal */ 
        break;
    schedule();
}

state should be TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.
If the state is set to TASK_INTERRUPTIBLE, a signal wakes the process up. The call to signal_pending() tells us whether we were awakened by a signal; if so, we need to return to the user and let them try again later. Otherwise, we reacquire the semaphore, and test again. After calling prepare_to_wait(), the process can call schedule(). When the task awakens, it again checks whether the condition is true. If it is, it exits the loop. Otherwise, it again calls schedule() and repeats.

removes itself from the wait queue

TASK_RUNNING

finish_wait


finish_wait(&q, &my_wait);

The example code that handles the actual sleep is:


if (down_interruptible(&dev->sem))
     return -ERESTARTSYS;

while (spacefree(dev) == 0) { /* full */ 
     DEFINE_WAIT(wait);

     up(&dev->sem);
     if (filp->f_flags & O_NONBLOCK)
         return -EAGAIN;
     PDEBUG("\"%s\" writing: going to sleep\n",current->comm); 
     prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
     if (spacefree(dev) == 0)
         schedule( ); 
     finish_wait(&dev->outq, &wait); 
     if ( signal_pending(current) )
         return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
     if (down_interruptible(&dev->sem))
         return -ERESTARTSYS;
}
up(&dev->sem);

Exclusive waits

If the number of processes in the wait queue is large, the sudden of many processes awaken can seriously degrade the performance of the system.