Linux Device Drivers -II

Content:
  • 4. Debugging Techniques
  • 5. Concurrency and Race Conditions
  • 6. Advanced Char Driver Operations




Chapter 4: Debugging Techniques


Debugging Support in the Kernel


Kernel provides some debug features, all of these options are found under the “kernel hacking” menu.
  • CONFIG_DEBUG_KERNEL
  • This option just makes other debugging options available.
  • CONFIG_DEBUG_SLAB
  • This crucial option turns on several types of checks in the kernel memory allocation functions.
  • CONFIG_DEBUG_PAGEALLOC
  • This can quickly point out certain kinds of memory corruption errors.
  • CONFIG_DEBUG_SPINLOCK
  • The kernel catches operations on uninitialized spin-locks and various other errors (such as unlocking a lock twice).
  • CONFIG_DEBUG_SPINLOCK_SLEEP
  • This option enables a check for attempts to sleep while holding a spinlock.
  • CONFIG_INIT_DEBUG
  • This option enables checks for code that attempts to access initialization-time memory after initialization is complete.
  • CONFIG_DEBUG_INFO
  • To debug the kernel with gdb.
  • CONFIG_MAGIC_SYSRQ
  • CONFIG_DEBUG_STACKOVERFLOW
  • Add explicit overflow checks to the kernel.
  • CONFIG_DEBUG_STACK_USAGE
  • Monitor stack usage and make some statistics available via the magic SysRq key.
  • CONFIG_KALLSYMS(under “General setup/Standard features”)
  • Cause kernel symbol information to be built into the kernel. With this, an oops listing can give you a kernel traceback in context.
  • CONFIG_IKCONFIG, CONFIG_IKCONFIG_PROC(in the “General setup” menu)
  • The full kernel configuration state to be built into the kernel and to be made available via /proc.
  • CONFIG_ACPI_DEBUG(Under “Power management/ACPI.”)
  • This option turns on verbose ACPI (Advanced Configuration and Power Interface) debugging information
  • CONFIG_DEBUG_DRIVER(Under “Device drivers.”)
  • useful for tracking down problems in the low-level support code
  • CONFIG_SCSI_CONSTANTS
  • CONFIG_INPUT_EVBUG(under “Device drivers/Input device support”)
  • Turn on verbose logging of input events
  • CONFIG_PROFILING(under “Profiling support.”)
  • It's useful for tracking down some kernel hangs and related problems.

Debugging by Printing


printk


printk() is similar to printf(), the format string, while largely compatible with C99, doesn’t follow the exact same specification.
All printk() messages are printed to the kernel log buffer, which is a ring buffer exported to userspace through /dev/kmsg. The usual way to read it is using dmesg.
printk() is typically used like this:

printk(KERN_INFO "Message: %s\n", arg);
where KERN_INFO is the log level.
The log level specifies the importance of a message. The kernel decides whether to show the message immediately (printing it to the current console) depending on the message's log level and the current console_loglevel (a kernel variable).
NameStringAlias function
KERN_EMERG“0”pr_emerg()
KERN_ALERT“1”pr_alert()
KERN_CRIT“2”pr_crit()
KERN_ERR“3”pr_err()
KERN_WARNING“4”pr_warn()
KERN_NOTICE“5”pr_notice()
KERN_INFO“6”pr_info()
KERN_DEBUG“7”pr_debug() and pr_devel() if DEBUG is defined
KERN_DEFAULT“”
KERN_CONT“c”pr_cont()

If the log level is omitted, the message is printed with KERN_DEFAULT level.
There are eight possible loglevel strings, defined in the header <linux/kern_levels.h> :


#define KERN_SOH "\001"  /* ASCII Start Of Header */
#define KERN_SOH_ASCII '\001'

#define KERN_EMERG KERN_SOH "0" /* system is unusable */
#define KERN_ALERT KERN_SOH "1" /* action must be taken immediately */
#define KERN_CRIT KERN_SOH "2" /* critical conditions */
#define KERN_ERR KERN_SOH "3" /* error conditions */
#define KERN_WARNING KERN_SOH "4" /* warning conditions */
#define KERN_NOTICE KERN_SOH "5" /* normal but significant condition */
#define KERN_INFO KERN_SOH "6" /* informational */
#define KERN_DEBUG KERN_SOH "7" /* debug-level messages */

#define KERN_DEFAULT KERN_SOH "d" /* the default kernel loglevel */


Each string (in the macro expansion) represents an integer in angle brackets.
Integers range from 0 to 7, with smaller values representing higher priorities.
If the priority is less than the integer variable console_loglevel, the message is delivered to the console one line at a time (nothing is sent unless a trailing newline is provided).

/*
 * Default used to be hard-coded at 7, quiet used to be hardcoded at 4,
 * we're now allowing both to be set from kernel config.
 */
#define CONSOLE_LOGLEVEL_DEFAULT CONFIG_CONSOLE_LOGLEVEL_DEFAULT

int console_printk[4] = {
 CONSOLE_LOGLEVEL_DEFAULT, /* console_loglevel */
 MESSAGE_LOGLEVEL_DEFAULT, /* default_message_loglevel */
 CONSOLE_LOGLEVEL_MIN,  /* minimum_console_loglevel */
 CONSOLE_LOGLEVEL_DEFAULT, /* default_console_loglevel */
};

#define console_loglevel (console_printk[0])
If both klogd and syslogd are running on the system, kernel messages are appended to /var/log. If klogd is not running, the message can be read via /proc/kmsg.
The text file /proc/sys/kernel/printk hosts four integer values:
  • the current loglevel
  • the default level for messages that lack an explicit loglevel
  • the minimum allowed loglevel
  • the boot-time default loglevel
you can cause all kernel messages to appear at the console by simply entering:

# echo 8 > /proc/sys/kernel/printk

Redirecting Console Messages


By default, the “console” is the current virtual terminal. To select a different virtual terminal to receive messages, you can issue ioctl(TIOCLINUX) on any console device.

How Messages Get Logged


The printk() function writes messages into a circular buffer that is __LOG_BUF_LEN bytes long: a value from 4 KB to 1 MB chosen while configuring the kernel. The function then wakes any process that is is sleeping in the syslog() system call or that is reading /proc/kmsg.
The dmesg command can be used to look at the content of the buffer without flushing it; actually, the command returns to stdout the whole content of the buffer, whether or not it has already been read.

klogd process retrieves kernel messages and dispatches them to syslogd.
syslog() generates a log message, which will be distributed by syslogd().

Rate Limiting


When using a slow console device (e.g., a serial port), an excessive message rate can also slow down the system or just make it unresponsive.

Printing Device Numbers


The kernel provides a couple of utility macros (defined in <linux/kdev_t.h>) for this purpose:

     int print_dev_t(char *buffer, dev_t dev);
     char *format_dev_t(char *buffer, dev_t dev);

Debugging by Querying


Using the /proc Filesystem


The /proc filesystem is a special, software-created filesystem that is used by the kernel to export information to the world. Each file under /proc is tied to a kernel function that generates the file’s “contents” on the fly when the file is read.
The advantage of a /proc file is that there is no overhead until you actually ask for the data.

Adding files under /proc is discouraged, the recommended way of making information available in new code is via sysf.

Implementing files in /proc


All modules that work with /proc should include <linux/proc_fs.h> to define the proper functions.
When a process reads from your /proc file, the kernel allocates a page of memory (i.e., PAGE_SIZE bytes) where the driver can write data to be returned to user space. That buffer is passed to a method with this interface:

struct proc_dir_entry *proc_create(const char *name, umode_t mode,
       struct proc_dir_entry *parent,
       const struct proc_ops *proc_ops)

void remove_proc_entry(const char *name, struct proc_dir_entry *parent)



struct proc_dir_entry {
...
 union {
  const struct proc_ops *proc_ops;
  const struct file_operations *proc_dir_ops;
 };
...
};


==== the following are deprecated ===

     int (*read_proc)(char *page, char **start, off_t offset, int count, int *eof, void *data);
  • page
  • A pointer to a page of memory(4 KB) allocated by the kernel.
  • start
  • offset
  • count
  • The bytes to be read.
  • eof
  • An integer that must be set by the driver to signal that it has no more data to return
  • data
  • A driver-specific data pointer you can use for internal bookkeeping.
This function should return the number of bytes of data actually placed in the page buffer.

Once you have a read_proc function defined, you need to connect it to an entry in the /proc hierarchy. This is done with a call to create_proc_read_entry:

     struct proc_dir_entry *create_proc_read_entry(const char *name, mode_t mode, struct proc_dir_entry *base, read_proc_t *read_proc, void *data);

Here is the call used by scull to make its /proc function available as /proc/scullmem:

     create_proc_read_entry("scullmem", 0 /* default mode */,
             NULL /* parent dir */, scull_read_procmem,
             NULL /* client data */);
Entries in /proc, of course, should be removed when the module is unloaded.

     remove_proc_entry("scullmem", NULL /* parent dir */);

The seq_file interface



/proc methods have become notorious for buggy implementations when the amount of output grows large(over one page). The seq_file interface provides a simple set of functions for the implementation of large kernel virtual files. It's based on sequence, which is composed of 3 functions: start(), next(), and stop(). The seq_file API starts a sequence when a user read the /proc file.

A sequence begins with the call of the function start(). If the return is a non NULL value, the function next() is called. This function is an iterator, the goal is to go thought all the data. Each time next() is called, the function show() is also called. It writes data values in the buffer read by the user. The function next() is called until it returns NULL. The sequence ends when next() returns NULL, then the function stop() is called.

BE CARREFUL: when a sequence is finished, another one starts. That means that at the end of function stop(), the function start() is called again. This loop finishes when the function start() returns NULL.


The first step, inevitably, is the inclusion of <linux/seq_file.h>. Then you must create four iterator methods
  • void *start(struct seq_file *sfile, loff_t *pos)
  • The start method is always called first. pos is an integer position indicating where the reading should start. The interpretation of the position is entirely up to the implementation. The position is often interpreted as a cursor pointing to the next item in the sequence. The scull driver interprets each device as one item in the sequence, so the incoming pos is sim- ply an index into the scull_devices array.
    
    static void *scull_seq_start(struct seq_file *s, loff_t *pos)
         {
             if (*pos >= scull_nr_devs)
                 return NULL;   /* It's the end of the sequence, return NULL to stop reading */
             return scull_devices + *pos;
         }
    
  • void *next(struct seq_file *sfile, void *v, loff_t *pos)
  • Its job is to move the iterator forward to the next position in the sequence. The next() function returns a new iterator, or NULL if the sequence is complete. The scull implementation has no cleanup work to do, so its stop method is empty.
  • void stop(struct seq_file *sfile, void *v);
  • This is called when the iterator is ended.
  • int show(struct seq_file *sfile, void *v)
  • In between these calls, the kernel calls the show method to actually output some- thing interesting to the user space. It should not use printk, however; instead, there is a special set of functions for seq_file output:
    • int seq_printf(struct seq_file *sfile, const char *fmt, ...)
    • int seq_putc(struct seq_file *sfile, char c)
    • int seq_puts(struct seq_file *sfile, const char *s)
    • int seq_escape(struct seq_file *m, const char *s, const char *esc)
    • int seq_path(struct seq_file *sfile, struct vfsmount *m, struct dentry *dentry, char *esc)
    the show method used in scull is:
    
    static int scull_seq_show(struct seq_file *s, void *v)
        {
            struct scull_dev *dev = (struct scull_dev *) v;
            struct scull_qset *d;
            int i;
            if (down_interruptible(&dev->sem))
                return -ERESTARTSYS;
            seq_printf(s, "\nDevice %i: qset %i, q %i, sz %li\n",
                    (int) (dev - scull_devices), dev->qset,
                    dev->quantum, dev->size);
            for (d = dev->data; d; d = d->next) { /* scan the list */
                seq_printf(s, "  item at %p, qset at %p\n", d, d->data);
                if (d->data && !d->next) /* dump only the last item */
                    for (i = 0; i < dev->qset; i++) {
                        if (d->data[i])
                            seq_printf(s, "    % 4i: %8p\n", i, d->data[i]);
                    } 
            }
        up(&dev->sem);
        return 0;
       }
    

    The ioctl Method


    As an alternative to using the /proc filesystem, you can implement a few ioctl commands tailored for debugging.
    You need another program to issue the ioctl and display the results. This program must be written, compiled, and kept in sync with the module you’re testing.

    Debugging by Watching


    Sometimes minor problems can be tracked down by watching the behavior of an application in user space.
    The strace command is a powerful tool that shows all the system calls issued by a user-space program. strace is most useful for pinpointing runtime errors from system calls.
    In the simplest case strace runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process. The name of each system call, its arguments and its return value are printed on standard error or to the file specified with the -o option.

    Debugging System Faults


    A fault usually results in the destruction of the current process while the system goes on working.
    The kernel calls the close operation for any open device when a process dies, your driver can release what was allocated by the open method.

    Oops Messages


    Almost any address used by the processor is a virtual address and is mapped to physical addresses through a complex structure of page tables.
    When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address, and the processor signals a page fault to the operating system. If the address is not valid, the kernel is not able to “page in” the missing address; it (usually) generates an oops if this happens while the processor is in supervisor mode.

    An oops displays the processor status at the time of the fault, including the contents of the CPU registers and other seemingly incomprehensible information. The message is generated by printk statements in the fault handler (arch/*/kernel/traps.c).
    In general, when you are confronted with an oops, the first thing to do is to look at the location where the problem happened, which is usually listed separately from the call stack.

    The stack itself is printed in hex form; you see a symbolic call stack (as shown above) only if your kernel is built with the CONFIG_KALLSYMS option turned on.

    Understanding a Kernel Oops!

    Setting up the machine to capture an Oops

    The running kernel should be compiled with CONFIG_DEBUG_INFO, and syslogd should be running.
    Let’s try to generate an Oops message with sample code, and try to understand the dump.
    
    #include <linux/kernel.h>
    #include <linux/module.h> 
    #include <linux/init.h> 
     
    static void create_oops() { 
            *(int *)0 = 0; 
    } 
     
    static int __init my_oops_init(void) { 
            printk("oops from the module\n"); 
            create_oops(); 
           return (0); 
    } 
    static void __exit my_oops_exit(void) { 
            printk("Goodbye world\n"); 
    } 
     
    module_init(my_oops_init); 
    module_exit(my_oops_exit);
    
    The associated Makefile for this module is as follows:
    
    obj-m   := oops.o 
    KDIR    := /lib/modules/$(shell uname -r)/build
    PWD     := $(shell pwd) 
    SYM=$(PWD) 
     
    all: 
            $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
    
    The running kernel should be compiled with CONFIG_DEBUG_INFO, and syslogd should be running.
    Once executed, the module generates the following Oops:
    
    
    

    System Hangs


    You can prevent an endless loop by inserting schedule() invocations at strategic points. The schedule() call invokes the scheduler and, therefore, allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, the schedule() calls enable you to kill the process after tracing what is happening.
    Be sure, however, not to call schedule() any time that your driver is holding a spinlock.

    Magic SysRq (Magic System Request) is a kernel hack that enables the kernel to listen to specific key presses and respond by calling a specific kernel function. Magic SysRq is activated via input from the keyboard or a serial line.
    Enable Magic SysRq (CONFIG_MAGIC_SYSRQ and CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE respectively):
    
    Kernel hacking  --->
       [*] Magic SysRq key
       (0x1) Enable magic SysRq key functions by default
    
    On amd64 and x86 systems the key combination of "Alt + SysRq + command-key" will result in Magic SysRQ invocation.
    command-key:
    • b
    • Immediately reboot the system without syncing or unmounting the disks.
    • e
    • Send a SIGTERM to all processes, except for init.
    • s
    • Attempts to sync all mounted filesystems.
    • u
    • Attempts to remount all mounted filesystems read-only.
    Other magic SysRq functions exist; see sysrq.txt in the Documentation directory of the kernel source for the full list.
    The file /proc/sysrq-trigger is a write- only entry point, where you can trigger a specific sysrq action by writing the associated command character

    Debuggers and Related Tools

    Using gdb


    You should be aware that debugging with gdb has some definite limitation, because a user space debugger peeks into the address space of a running kernel:
    • You can examine the content of the kernel space
    • But you can’t be able to:
      • Set breakpoint
      • Step through the kernel code
    To use gdb, you need to provide:
    • the uncompressed ELF kernel executable /vmlinux
    • The kernel sources now contain a script extract-vmlinux which can extract uncompressed vmlinux from a kernel image.
      
      # sudo -i
      # /usr/src/linux-headers-5.4.0-52-generic/scripts/extract-vmlinux /boot/vmlinuz-5.4.0-52-generic > vmlinux
        
      The uncompressed image must include the configuration options CONFIG_PROC_KCORE and CONFIG_DEBUG_INFO.
      To check it,
      
      /boot$ grep CONFIG_PROC_KCORE config-5.4.0-52-generic; grep CONFIG_DEBUG_INFO config-5.4.0-52-generic
      CONFIG_PROC_KCORE=y
      CONFIG_DEBUG_INFO=y  
        
      Your vmlinux file must match your running kernel, or symbols and addresses won’t match and you’ll get garbage during your gdb session.
    • the core file of the image
    • /proc/kcore.
      This file represents the physical memory of the system and is stored in the core file format.
      /proc/kcore displays a size which is equal to the size of the physical memory (RAM) used in byte plus 4 KB.
      It is a huge file, because it represents the whole kernel address space, which corresponds to all physical memory.

    The values displayed will represent what they were at the time of invoking gdb.

    
    # gdb vmlinux /proc/kcore
    ...
    Reading symbols from vmlinux...(no debugging symbols found)...done.
    [New process 1]
    Core was generated by `BOOT_IMAGE=/boot/vmlinuz-5.4.0-52-generic root=UUID=b914dd46-1239-455a-b372-f54'.
    #0  0x0000000000000000 in ?? ()
    (gdb)   
    
    Note that, in order to have symbol information available for gdb, you must compile your kernel with the CONFIG_DEBUG_INFO option set.
    It's possible when you package your vmlinuz image, the debug symbols were stripped (when using make-kpkg to build deb package for linux kernel).
    You have to use the built vmlinux file under your linux source tree to have those debug symbols.
    From within gdb, you can look at kernel variables by issuing the standard gdb commands. For example,
    
    (gdb) p jiffies 
    
    prints the number of clock ticks from system boot to the current time.
    Numerous capabilities normally provided by gdb are not available when you are working with the kernel. For example,
    • gdb is not able to modify kernel data; it expects to be running a program to be debugged under its own control before playing with its memory image.
    • It is also not possible to set breakpoints or watchpoints, or to single-step through kernel functions.

    Linux loadable modules are ELF-format executable images which typically has three sections relevant in a debugging session:
    • .text
    • This section contains the executable code for the module. The debugger must know where this section is to be able to give tracebacks or set breakpoints. (Neither of these operations is relevant when running the debugger on /proc/kcore, but they can useful when working with kgdb, described below).
    • .bss
    • Any variable that is not initialized at compile time ends up in .bss,
    • .data
    • Variables are initialized go into .data.

    Debugging the Linux kernel using the GDB

    Debugging Loadable Modules

    Debugging an external Linux kernel module requires some specific actions, especially because the symbols for this module are not part of the main vmlinux symbol file.
    Consider the following loadable module (in the source file hellop.c):
    
    #include <linux/init.h>
    #include <linux/module.h>
    #include <linux/moduleparam.h>
    
    MODULE_LICENSE("Dual BSD/GPL");
                                                   
    
    static char *whom = "world";
    static int howmany = 1;
    module_param(howmany, int, S_IRUGO);
    module_param(whom, charp, S_IRUGO);
    
    static int hello_init(void)
    {
    	int i;
    	for (i = 0; i < howmany; i++)
    		printk(KERN_ALERT "(%d) Hello, %s\n", i, whom);
    	return 0;
    }
    
    static void hello_exit(void)
    {
    	printk(KERN_ALERT "Goodbye, cruel world\n");
    }
    
    module_init(hello_init);
    module_exit(hello_exit);
    
    • nm - list symbols from object file
    • 
      $ nm hellop.ko
      0000000000000050 T cleanup_module
                       U __fentry__
      0000000000000050 t hello_exit
      0000000000000000 t hello_init
      0000000000000000 d howmany
      0000000000000000 T init_module
      0000000000000000 r _note_6
      0000000000000028 r __param_howmany
                       U param_ops_charp
                       U param_ops_int
      0000000000000008 r __param_str_howmany
      0000000000000000 r __param_str_whom
      0000000000000000 r __param_whom
                       U printk
      0000000000000000 D __this_module
      0000000000000061 r __UNIQUE_ID_depends39
      0000000000000014 r __UNIQUE_ID_howmanytype37
      0000000000000029 r __UNIQUE_ID_license36
      0000000000000076 r __UNIQUE_ID_name37
      000000000000006a r __UNIQUE_ID_retpoline38
      000000000000003e r __UNIQUE_ID_srcversion40
      0000000000000082 r __UNIQUE_ID_vermagic36
      0000000000000000 r __UNIQUE_ID_whomtype38
      0000000000000008 d whom
      	
    • objdump - display information about one or more object files
      • -t, --syms
      • Display the contents of the symbol table(s).(all sections)
        
        $ objdump -t hellop.ko
        
        hellop.ko:     file format elf64-x86-64
        
        SYMBOL TABLE:
        0000000000000000 l    d  .note.gnu.build-id	0000000000000000 .note.gnu.build-id
        0000000000000000 l    d  .text	0000000000000000 .text
        0000000000000000 l    d  .rodata.str1.1	0000000000000000 .rodata.str1.1
        0000000000000000 l    d  __mcount_loc	0000000000000000 __mcount_loc
        0000000000000000 l    d  .modinfo	0000000000000000 .modinfo
        0000000000000000 l    d  __param	0000000000000000 __param
        0000000000000000 l    d  .rodata	0000000000000000 .rodata
        0000000000000000 l    d  .note.Linux	0000000000000000 .note.Linux
        0000000000000000 l    d  .data	0000000000000000 .data
        0000000000000000 l    d  .gnu.linkonce.this_module	0000000000000000 .gnu.linkonce.this_module
        0000000000000000 l    d  .bss	0000000000000000 .bss
        0000000000000000 l    d  .comment	0000000000000000 .comment
        0000000000000000 l    d  .note.GNU-stack	0000000000000000 .note.GNU-stack
        0000000000000000 l    df *ABS*	0000000000000000 hellop.c
        0000000000000000 l     F .text	0000000000000045 hello_init
        0000000000000000 l     O .data	0000000000000004 howmany
        0000000000000008 l     O .data	0000000000000008 whom
        0000000000000050 l     F .text	0000000000000017 hello_exit
        0000000000000000 l     O .modinfo	0000000000000014 __UNIQUE_ID_whomtype38
        0000000000000000 l     O __param	0000000000000028 __param_whom
        0000000000000000 l     O .rodata	0000000000000005 __param_str_whom
        0000000000000014 l     O .modinfo	0000000000000015 __UNIQUE_ID_howmanytype37
        0000000000000028 l     O __param	0000000000000028 __param_howmany
        0000000000000008 l     O .rodata	0000000000000008 __param_str_howmany
        0000000000000029 l     O .modinfo	0000000000000015 __UNIQUE_ID_license36
        0000000000000000 l    df *ABS*	0000000000000000 hellop.mod.c
        000000000000003e l     O .modinfo	0000000000000023 __UNIQUE_ID_srcversion40
        0000000000000061 l     O .modinfo	0000000000000009 __UNIQUE_ID_depends39
        000000000000006a l     O .modinfo	000000000000000c __UNIQUE_ID_retpoline38
        0000000000000076 l     O .modinfo	000000000000000c __UNIQUE_ID_name37
        0000000000000082 l     O .modinfo	000000000000002a __UNIQUE_ID_vermagic36
        0000000000000000 l     O .note.Linux	0000000000000018 _note_6
        0000000000000000 g     O .gnu.linkonce.this_module	0000000000000380 __this_module
        0000000000000050 g     F .text	0000000000000017 cleanup_module
        0000000000000000         *UND*	0000000000000000 __fentry__
        0000000000000000 g     F .text	0000000000000045 init_module
        0000000000000000         *UND*	0000000000000000 printk
        0000000000000000         *UND*	0000000000000000 param_ops_charp
        0000000000000000         *UND*	0000000000000000 param_ops_int
            
            	
    The current gdb session has no idea about those module symbols, you need to tell GDB how to obtain symbol information for those modul.
    Information on where the module various ELF sections were loaded into the kernel space can be obtained by looking into the /sys/module/<module_name>/sections directory.
    
    /sys/module/hellop/sections$ ls -al
    total 0
    drwxr-xr-x 2 root root  0  十  23 16:16 .
    drwxr-xr-x 6 root root  0  十  23 16:16 ..
    -r-------- 1 root root 19  十  23 16:36 .data
    -r-------- 1 root root 19  十  23 16:37 .gnu.linkonce.this_module
    -r-------- 1 root root 19  十  23 16:37 __mcount_loc
    -r-------- 1 root root 19  十  23 16:37 .note.gnu.build-id
    -r-------- 1 root root 19  十  23 16:37 .note.Linux
    -r-------- 1 root root 19  十  23 16:37 __param
    -r-------- 1 root root 19  十  23 16:37 .rodata
    -r-------- 1 root root 19  十  23 16:37 .rodata.str1.1
    -r-------- 1 root root 19  十  23 16:37 .strtab
    -r-------- 1 root root 19  十  23 16:37 .symtab
    -r-------- 1 root root 19  十  23 16:16 .text
    
    /sys/module/hellop/sections$ sudo cat .text .rodata .data
    0xffffffffc09e4000
    0xffffffffc09e50b8
    0xffffffffc09e6000
    
    Adding the module symbol file to the GDB environment is by using the add-symbol-file GDB command:
    
    (gdb) add-symbol-file <module.ko_path> <address of module's text section> \
      -s .data <address of module's data section>  -s .bss <address of module's bss section if available>
        
    We can now tell gdb all about those sections:
    
    (gdb) add-symbol-file /home/jerry/ldd4/misc-modules/hellop.ko 0xffffffffc09e4000 -s .rodata 0xffffffffc09e50b8 -s .data 0xffffffffc09e6000
    (y or n) y
    Reading symbols from /home/jerry/ldd4/misc-modules/hellop.ko...(no debugging symbols found)...done.
    
    The debug symbol was not enabled in the built module. To enable the debug support, prefix the following in the Makefile then build the module again:
    
    ccflags-y += -g
    
    Load the debug symbol again:
    
    (gdb) add-symbol-file /home/jerry/ldd4/misc-modules/hellop.ko 0xffffffffc09e4000 -s .rodata 0xffffffffc09e50b8 -s .data 0xffffffffc09e6000
    ...
    add symbol table from file "/home/jerry/ldd4/misc-modules/hellop.ko" at
    	.text_addr = 0xffffffffc09e4000
    	.rodata_addr = 0xffffffffc09e50b8
    	.data_addr = 0xffffffffc09e6000
    (y or n) y
    Reading symbols from /home/jerry/ldd4/misc-modules/hellop.ko...done.
    (gdb) p howmany
    $1 = 1
    (gdb) p whom
    $2 = 0xffffffffc09e504e "world"
    (gdb) 
    

    Chapter 5: Concurrency and Race Conditions


    Device driver programmers must now factor concurrency into their designs from the beginning, and they must have a strong understanding of the facilities provided by the kernel for concurrency management.

    Race conditions are a result of uncontrolled access to shared data. When the wrong access pattern happens, something unexpected results.

    Concurrency and Its Management


    Race conditions come about as a result of shared access to resources.
    So the first rule of thumb to keep in mind as you design your driver is to avoid shared resources whenever possible. The most obvious application of this idea is to avoid the use of global variables.

    Sharing is a fact of life. The usual technique for access management is called locking or mutual exclusion — making sure that only one thread of execution can manipulate a shared resource at any time.

    There are two main types of kernel locks.
    The fundamental type is the spinlock (include/asm/spinlock.h), which is a very simple single-holder lock: if you can't get the spinlock, you keep trying (spinning) until you can (busy wait). Spinlocks are very small and fast, and can be used anywhere.
    The second type is a mutex (include/linux/mutex.h), if you can't lock a mutex, your task will suspend itself, and be woken up when the mutex is released. This means the CPU can do something else while you are waiting.

    Cheat Sheet For Locking:
    1. If you are in a process context (any syscall) and want to lock other process out, use a mutex. You can take a mutex and sleep ( copy_from_user*( or kmalloc(x,GFP_KERNEL) ).
    2. A kernel driver/module providing services for user process.
    3. Otherwise (== data can be touched in an interrupt), use spin_lock_irqsave() and spin_unlock_irqrestore().
    4. Avoid holding spinlock for more than 5 lines of code and across any function call (except accessors like readb).

    Spinlock


    Unlike semaphores, spinlocks may be used in code that cannot sleep, such as interrupt handlers.
    The spinlock is a low-level synchronization mechanism which can be in two states:
    • acquired
    • released
    Code wishing to take out a particular lock tests the relevant bit. If the lock is available, the “locked” bit is set and the code continues into the critical section. If, instead, the lock has been taken by somebody else, the code goes into a tight loop where it repeatedly checks the lock until it becomes available.
    The “test and set” operation must be done in an atomic manner so that only one thread can obtain the lock, even if several are spinning at any given time.

    Locks and Uniprocessor Kernels


    Under non-preemptive scheduling, once the CPU has been allocated to a process, the process keeps the CPU until it releases the CPU either by terminating or by switching to the waiting state.

    For kernels compiled:

    • without CONFIG_SMP and without CONFIG_PREEMPT
    • spinlocks do not exist at all.
      This is an excellent design decision: when no-one else can run at the same time, there is no reason to have a lock.
    • compiled without CONFIG_SMP, but CONFIG_PREEMPT is set
    • spinlocks simply disable preemption, which is sufficient to prevent any races.
      For most purposes, we can think of preemption as equivalent to SMP, and not worry about it separately.
    You should always test your locking code with CONFIG_SMP and CONFIG_PREEMPT enabled, even if you don't have an SMP test box, because it will still catch some kinds of locking bugs.

    Introduction to the Spinlock API


    The spinlock is represented by the spinlock_t type in the Linux kernel:
    
    typedef struct raw_spinlock {
    	arch_spinlock_t raw_lock;
    #ifdef CONFIG_DEBUG_SPINLOCK
    	unsigned int magic, owner_cpu;
    	void *owner;
    #endif
    #ifdef CONFIG_DEBUG_LOCK_ALLOC
    	struct lockdep_map dep_map;
    #endif
    } raw_spinlock_t;
    
    typedef struct spinlock {
    	union {
    		struct raw_spinlock rlock;
    
    #ifdef CONFIG_DEBUG_LOCK_ALLOC
    # define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))
    		struct {
    			u8 __padding[LOCK_PADSIZE];
    			struct lockdep_map dep_map;
    		};
    #endif
    	};
    } spinlock_t;
    
    where the arch_spinlock_t represents architecture-specific spinlock implementation.
    • include
    • 
      #include <linux/spinlock.h>
      
    • init
    • 
      spinlock_t my_lock = SPIN_LOCK_UNLOCKED; // at compiling time
      void spin_lock_init(spinlock_t *lock);   // at run-time
      
    • lock
    • 
      void spin_lock(spinlock_t *lock);
      
    • unlock
    • 
      void spin_unlock(spinlock_t *lock);
      

    Spinlocks and Atomic Context


    Any code holding a spinlock must be atomic.
    Whenever the kernel code holds a spinlock, preemption is disabled on the relevant processor.

    Many kernel functions can sleep( ie. call schedule() ) and are not well documented. For ex.,

    • Copying data to or from user space
    • The required user-space page may need to be swapped in from the disk before the copy can proceed. This operation clearly requires a sleep.
    • Memory allocation
    • kmalloc() can decide to give up the processor, and wait for more memory to become available unless it is explicitly told not to.

    The Spinlock Functions


    There are actually four functions that can lock a spinlock:
    • void spin_lock(spinlock_t *lock)
    • This is used to lock between Kernel Threads(user context) or bottom halves.
    • void spin_lock_irqsave(spinlock_t *lock, unsigned long flags)
    • This disables interrupts (on the relavent processor only) before acquiring the spinlock; the previous interrupt state is stored in flags.
      ( interrupts can still occur but are routed to other available CPUs, on which the interrupt handler can spin until lock becomes available.)
    • void spin_lock_irq(spinlock_t *lock)
    • This is you share data between Hardware ISR and Bottom halves because you have to disable the IRQ before acquiring the lock.
      This code will disable interrupts on relavent cpu so that it is only safe when you know that interrupts were not already disabled before the acquisition of the lock. As the kernel grows in size and kernel code paths become increasingly hard to predict, it is suggested you not use this version unless you really know what you are doing.
    • void spin_lock_bh(spinlock_t *lock)
    • This is used when you share data with a bottom half and user context (like Kernel Thread).
      This disables software interrupts before acquiring the lock, but leaves hardware interrupts enabled(allowing hardware interrupts to be serviced).
      This has the effect of preventing softirqs, tasklets, and bottom halves from running on the relavant CPU.
    The corresponding functions to release a spinlock:
    • void spin_unlock(spinlock_t *lock)
    • void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
    • void spin_unlock_irq(spinlock_t *lock)
    • void spin_unlock_bh(spinlock_t *lock)
    There is also a set of nonblocking spinlock operations:
    • int spin_trylock(spinlock_t *lock)
    • int spin_trylock_bh(spinlock_t *lock)

    Spinlock in Linux kernel example programming

    Reader-Writer Spin Locks


    Sometimes, lock usage can be clearly divided into readers and writers.
    For example, consider a list that is both updated and searched. When the list is updated (written to), it is important that no other threads of execution concurrently write to or read from the list. Writing demands mutual exclusion. On the other hand, when the list is searched (read from), it is only important that nothing else write to the list. Multiple concurrent readers are safe so long as there are no writers.
    Linux provides reader-writer spin locks. Reader-writer spin locks provide separate reader and writer variants of the lock. One or more readers can concurrently hold the reader lock. The writer lock, conversely, can be held by at most one writer with no concurrent readers.

    Usage is similar to spin locks.
    • init
    • 
      #include <linux/rwlock.h>
      
      rwlock_t mr_rwlock = RW_LOCK_UNLOCKED;
      
    • in the reader code path
    • 
      read_lock(&mr_rwlock);
      /* critical section (read only) ... */
      read_unlock(&mr_rwlock);
      
    • in the writer code path
    • 
      write_lock(&mr_rwlock);
      /* critical section (read and write) ... */
      write_unlock(&mr_lock);
      
    A final important consideration in using the Linux reader-writer spin locks is that they favor readers over writers. If the read lock is held and a writer is waiting for exclusive access, readers that attempt to acquire the lock will continue to succeed. The spinning writer does not acquire the lock until all readers release the lock. Therefore, a sufficient number of readers can starve pending writers. This is important to keep in mind when designing your locking.

    Spin locks provide a very quick and simple lock. The spinning behavior is optimal for short hold times and code that cannot sleep (interrupt handlers, for example). In cases where the sleep time might be long or you potentially need to sleep while holding the lock, the semaphore is a solution.


    Semaphores and Mutexes


    We must set up critical sections: code that can be executed by only one thread at any given time.
    Not all critical sections are the same, some critical sections can't be operated immediately, so the kernel provides different primitives for different needs.
    When a Linux process reaches a point where it cannot make any further processes, it goes to sleep (or “block”), yielding the processor to somebody else until some future time when it can get work done again.

    A semaphore is a single integer value combined with a pair of functions that are typically called P and V. A process wishing to enter a critical section will call P on the relevant semaphore; if the semaphore’s value is greater than zero, that value is decremented by one and the process continues. If the semaphore’s value is 0 (or less), the process must wait until somebody else releases the semaphore. Unlocking a semaphore is accomplished by calling V; this function increments the value of the semaphore and, if necessary, wakes up processes that are waiting for the semaphore.
    When semaphores are used for mutual exclusion — keeping multiple processes from running within a critical section simultaneously — their value will be initially set to 1.

    Such a semaphore can be held only by a single process or thread at any given time. A semaphore used in this mode is sometimes called a mutex(“mutual exclusion”).

    The Linux Semaphore Implementation


    Semaphores in Linux are sleeping locks. When a task attempts to acquire a semaphore that is already held, the semaphore places the task onto a wait queue and puts the task to sleep. The processor is then free to execute other code. When the processes holding the semaphore release the lock, one of the tasks on the wait queue is awakened so that it can then acquire the semaphore.

    To use semaphores, kernel code must include <linux/semaphore.h>. The relevant type is struct semaphore:
    
    /* Please don't access any members of this structure directly */
    struct semaphore {
     raw_spinlock_t  lock;
     unsigned int  count;
     struct list_head wait_list;
    };
    
    #define __SEMAPHORE_INITIALIZER(name, n)    \
    {         \
     .lock  = __RAW_SPIN_LOCK_UNLOCKED((name).lock), \
     .count  = n,      \
     .wait_list = LIST_HEAD_INIT((name).wait_list),  \
    }
    
    #define DEFINE_SEMAPHORE(name) \
     struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
    
    static inline void sema_init(struct semaphore *sem, int val)
    {
     static struct lock_class_key __key;
     *sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
     lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);
    }
    
    extern void down(struct semaphore *sem);
    extern int __must_check down_interruptible(struct semaphore *sem);
    extern int __must_check down_killable(struct semaphore *sem);
    extern int __must_check down_trylock(struct semaphore *sem);
    extern int __must_check down_timeout(struct semaphore *sem, long jiffies);
    extern void up(struct semaphore *sem);
    
    Usually, semaphores are used in a mutex mode.
    “down” refers to the fact that the function decrements the value of the semaphore and, perhaps after putting the caller to sleep for a while to wait for the semaphore to become available. There are three versions of down:
    • void down(struct semaphore *sem)
    • Acquires the semaphore.
      Use of this function is deprecated, please use down_interruptible() or down_killable() instead.
    • int down_interruptible(struct semaphore *sem)
    • Acquire the semaphore unless interrupted.
      If no more tasks are allowed to acquire the semaphore, calling this function will put the task to sleep. If the sleep is interrupted by a signal, this function will return -EINTR. If the semaphore is successfully acquired, this function returns 0.
    • int down_trylock(struct semaphore *sem)
    • Try to acquire the semaphore, without waiting.
      Returns 0 if the semaphore has been acquired successfully or 1 if it it cannot be acquired. Unlike mutex_trylock, this function can be used from interrupt context, and the semaphore can be released by any task or interrupt.

    Once a thread has successfully called one of the versions of down, it is said to be “holding” the semaphore.
    Once up has been called, the caller no longer holds the semaphore.

    Using Semaphores


    Let’s define a structure:
    
         struct hello_dev {
            char   priv_data;
            struct semaphore sem;     /* mutual exclusion semaphore     */
            struct cdev cdev;     /* Char device structure      */
    };
    
    We have chosen to use a separate semaphore for each virtual device.

    Make sure that no accesses to the hello_dev data structure are made without holding the semaphore when the driver's method is called:
    
    if (down_interruptible(&dev->sem))
                return -ERESTARTSYS;
    
    The Driver's methods must release the semaphore before leaving:
    
    out:
           up(&dev->sem);
           return retval;
    

    Reader-Writer Semaphores


    Semaphores, like spin locks, also come in a reader-writer flavor. All reader-writer semaphores are mutexes (that is, their usage count is one).
    Reader-writer semaphores are represented by the struct rw_semaphore type, which is declared in <linux/rwsem.h>.
    • init
    • 
      static DECLARE_RWSEM(name); // or init_rwsem(struct rw_semaphore *sem)
      
      where name is the declared name of the new semaphore.
    • in reader's path
    • 
      /* attempt to acquire the semaphore for reading ... */
      down_read(&mr_rwsem);
      
      /* critical region (read only) ... */
      
      /* release the semaphore */
      up_read(&mr_rwsem);
      
    • in writer's path
    • 
      /* attempt to acquire the semaphore for writing ... */
      down_write(&mr_rwsem);
      
      /* critical region (read and write) ... */
      
      /* release the semaphore */
      up_write(&mr_sem);
      
      
    Reader-writer semaphores, as spin locks of the same nature, should not be used unless there is a clear separation between write paths and read paths in your code. Supporting the reader-writer mechanisms has a cost, and it is worthwhile only if your code naturally splits along a reader/writer boundary.

    Semaphores were only in older 2.6.16 kernels, now mutex is used to replace earlier semaphores implementation, <linux/mutex.h>:
    
    /*
     * Simple, straightforward mutexes with strict semantics:
     *
     * - only one task can hold the mutex at a time
     * - only the owner can unlock the mutex
     * - multiple unlocks are not permitted
     * - recursive locking is not permitted
     * - a mutex object must be initialized via the API
     * - a mutex object must not be initialized via memset or copying
     * - task may not exit with mutex held
     * - memory areas where held locks reside must not be freed
     * - held mutexes must not be reinitialized
     * - mutexes may not be used in hardware or software interrupt contexts such as tasklets and timers
     */
    struct mutex {
     atomic_long_t  owner;
     spinlock_t  wait_lock;
    #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
     struct optimistic_spin_queue osq; /* Spinner MCS lock */
    #endif
     struct list_head wait_list;
    #ifdef CONFIG_DEBUG_MUTEXES
     void   *magic;
    #endif
    #ifdef CONFIG_DEBUG_LOCK_ALLOC
     struct lockdep_map dep_map;
    #endif
    };
    

    • mutex_init(struct mutex *lock)
    • Initialize the mutex to unlocked state.
      
      #define mutex_init(mutex)      \
      do {         \
       static struct lock_class_key __key;    \
               \
       __mutex_init((mutex), #mutex, &__key);    \
      } while (0)
      
      void
      __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
      {
       atomic_long_set(&lock->owner, 0);
       spin_lock_init(&lock->wait_lock);
       INIT_LIST_HEAD(&lock->wait_list);
      #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
       osq_lock_init(&lock->osq);
      #endif
      
       debug_mutex_init(lock, name, key);
      }
      EXPORT_SYMBOL(__mutex_init);
      
      The # operator ( known as the "Stringification Operator" which is defined in the C preprocessor ) converts a token into a C string literal, escaping any quotes or backslashes appropriately.
    • void __sched mutex_lock(struct mutex * lock)
    • Lock the mutex exclusively for this task.
      If the mutex is not available right now, it will sleep until it can get it. The mutex must later on be released by the same task that acquired it. Recursive locking is not allowed. The task may not exit without first unlocking the mutex. Also, kernel memory where the mutex resides must not be freed with the mutex still locked. The mutex must first be initialized (or statically defined) before it can be locked. memset-ing the mutex to 0 is not allowed. ( The CONFIG_DEBUG_MUTEXES .config option turns on debugging checks that will enforce the restrictions and will also do deadlock debugging. ) This function is similar to (but not equivalent to) down.
    • void __sched mutex_unlock(struct mutex * lock);
    • Unlock a mutex that has been locked by this task previously. This function must not be used in interrupt context. Unlocking of a not locked mutex is not allowed. This function is similar to (but not equivalent to) up.
    • int mutex_is_locked (struct mutex * lock)
    • Returns 1 if the mutex is locked, 0 if unlocked.
    • ww_mutex_unlock
    • Unlock a mutex that has been locked by this task previously with any of the ww_mutex_lock* functions (with or without an acquire context). It is forbidden to release the locks after releasing the acquire context.
    • int __sched mutex_lock_interruptible (struct mutex * lock)
    • Lock the mutex like mutex_lock, and return 0 if the mutex has been acquired or sleep until the mutex becomes available. If a signal arrives ( thread/process receives the signal ) while waiting for the lock then this function returns -EINTR. This function is similar to (but not equivalent to) down_interruptible.
    • mutex_trylock
    • Try to acquire the mutex atomically.
      Returns 1 if the mutex has been acquired successfully, and 0 on contention.
    • atomic_dec_and_mutex_lock
    • return true and hold lock if we dec to 0, return false otherwise

    Locking Traps


    In this section, we take a quick look at things that can go wrong.

    Ambiguous Rules


    When you create a resource that can be accessed concurrently, you should define which lock will control that access.
    In the case of scull, the design decision taken was to require all functions invoked directly from system calls to acquire the semaphore applying to the device structure that is accessed.

    Lock Ordering Rules


    Try to avoid situations where you need more than one lock.

    Fine- Versus Coarse-Grained Locking


    The 1st Linux kernel contained exactly one spinlock. The big kernel lock turned the entire kernel into one large critical section; only one CPU could be executing kernel code at any given time.
    A modern kernel can contain thousands of locks, each protecting one small resource. This sort of fine-grained locking can be good for scalability; it allows each processor to work on its specific task without contending for locks used by other processors.

    Locking in a device driver is usually relatively straightforward; you can have a single lock that covers everything you do, or you can create one lock for every device instance you manage.

    Alternatives to Locking


    Lock-Free Algorithms


    The circular buffer algorithm involves a producer placing data into one end of an array, while the consumer removes data from the other. So a circular buffer requires an array and two index values to track where the next new value goes and which value should be removed from the buffer next. The producer is the only thread that is allowed to modify the write index and the array location it points to.

    There is a generic circular buffer implementation available in the kernel; see <linux/kfifo.h> for information on how to use it.

    Atomic Variables


    In general, safe access to a global variable is ensured by using atomic operations.
    Atomic operations provide instructions that execute atomically without interruption.
    The kernel provides two sets of interfaces for atomic operations: one operates on integers and another that operates on individual bits.

    Atomic Integer Operations

    The atomic integer methods operate on a atomic integer type called atomic_t, defined in <asm/atomic.h>.
    An atomic_t holds an int value on all supported architectures. Because of the way this type works on some processors, however, the full integer range may not be available; thus, you should not count on an atomic_t holding more than 24 bits.
    
    atomic_t v;                   /* define v */
    atomic_t u = ATOMIC_INIT(0);     /* define u and initialize it to zero */
    
    atomic_set(&v, 4);     /* v = 4 (atomically) */
    atomic_add(2, &v);     /* v = v + 2 = 6 (atomically) */
    atomic_inc(&v);        /* v = v + 1 = 7 (atomically) */
    
    printk("%d\n", atomic_read(&v)); /* will print "7" */
    
    atomic_t data items must be accessed only through these functions. If you pass an atomic item to a function that expects an integer argument, you’ll get a compiler error.

    Atomic Bitwise Operations

    In addition to atomic integer operations, the kernel also provides a family of functions that operate at the bit level. Not surprisingly, they are architecture specific and defined in <asm/bitops.h>.
    The operations are on generic memory addresses of the number whose bit to be operated.
    The arguments in the bitwise functions are a bit index and a pointer to the number.
    
    unsigned long word = 0;
    
    set_bit(0, &word);       /* bit zero is now set (atomically) */
    set_bit(1, &word);       /* bit one is now set (atomically) */
    printk("%ul\n", word);   /* will print "3" */
    clear_bit(1, &word);     /* bit one is now unset (atomically) */
    change_bit(0, &word);    /* bit zero is flipped; now it is unset (atomically) */
    
    /* atomically sets bit zero and returns the previous value (zero) */
    if (test_and_set_bit(0, &word)) {
            /* never true     */
    }
    
    /* the following is legal; you can mix atomic bit instructions with normal C */
    word = 7;
    

    seqlocks


    Read-Copy-Update




    Chapter 6: Advanced Char Driver Operations


    ioctl


    In user space, the ioctl system call has the following prototype:
    
     int ioctl(int fd, unsigned long cmd, ...);
    
    the dots in the prototype represent a single optional argument, traditionally identified as char *argp.
    Using a pointer in the 3rd argument is the way to pass arbitrary data to the ioctl call; the device is then able to exchange any amount of data with user space.
    Each ioctl command is, essentially, a separate, usually undocumented system call.

    The ioctl driver method has a prototype:
    
    int (*ioctl) (struct inode *inode, struct file *filp,
                       unsigned int cmd, unsigned long arg)
    
    The inode and filp pointers are the values corresponding to the file descriptor fd passed on by the application.
    The cmd argument is passed from the user unchanged, and the optional arg argument is passed in the form of an unsigned long. If the invoking program doesn’t pass a third argument, the arg value received by the driver operation is undefined.

    Choosing the ioctl Commands


    The ioctl command numbers should be unique across the system in order to prevent errors caused by issuing the right command to the wrong device. If each ioctl number is unique, the application gets an EINVAL error rather than succeeding in doing something unintended.
    To help programmers create unique ioctl command codes, these codes have been split up into several bitfields:
    • type
    • (magic number). Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide (_IOC_TYPEBITS).
    • number
    • The ordinal (sequential) number. It’s eight bits (_IOC_NRBITS) wide
    • direction
    • The direction of data transfer, _IOC_NONE (no data transfer), _IOC_READ, _IOC_WRITE, and _IOC_READ|_IOC_WRITE (data is transferred both ways). The data transfer is seen from the application’s point of view; _IOC_READ means reading from the device, so the driver must write to user space.
    • sizet
    • The size of user data involved.
    you should first check include/asm/ioctl.h and Documentation/ioctl-number.txt for well defined ioctl commands.
    The header file , which is included by , defines macros that help set up the ioctl command numbers :

    
    #define _IO(type,nr)  _IOC(_IOC_NONE,(type),(nr),0)
    #define _IOR(type,nr,size) _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
    #define _IOW(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
    #define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
    
    • _IO(type,nr)
    • (for a command that has no argument)
    • _IOR(type,nr,datatype)
    • userland is reading and kernel is writing(for applications to read data from the driver)
    • _IOW(type,nr,datatype)
    • userland is writing and kernel is reading(for application to write data to the driver)
    • _IOWR(type,nr,datatype)
    • (for bidirectional transfers).
    The type and number fields are passed as arguments, and the size field is derived by applying sizeof to the datatype argument.
    Macros to decode the numbers: _IOC_DIR(nr), _IOC_TYPE(nr), _IOC_NR(nr), and _IOC_SIZE(nr).
    Here is how some ioctl commands are defined in scull:
    
    #define SCULL_IOC_MAGIC  'k'
    
    #define SCULL_IOCRESET    _IO(SCULL_IOC_MAGIC, 0)
    #define SCULL_IOCSQUANTUM _IOW(SCULL_IOC_MAGIC,  1, int)
    .
    #define SCULL_IOCGQSET    _IOR(SCULL_IOC_MAGIC,  6, int)
    .
    #define SCULL_IOCXQSET    _IOWR(SCULL_IOC_MAGIC,10, int)
    .
    #define SCULL_IOC_MAXNR 14
    

    The Return Value


    It’s pretty common to return -EINVAL in response to an invalid ioctl command.

    The Predefined Commands


    A few commands are recognized by the kernel. Note that these commands, when applied to your device, are decoded before your own file operations are called.
    Device driver writers are interested only in the group of commands issued on any file, whose magic number is “T.”
    The following ioctl commands are predefined for any file, including device-special files:
    • FIOCLEX
    • Set the close-on-exec flag (File IOctl CLose on EXec).
    • FIONCLEX
    • Clear the close-on-exec flag (File IOctl Not CLos on EXec).
    • FIOASYNC
    • Set or reset asynchronous notification for the file.
    • FIOQSIZE
    • This command returns the size of a file or directory; when applied to a device file, however, it yields an ENOTTY error return.
    • FIONBIO
    • “File IOctl Non-Blocking I/O”. This call modifies the O_NONBLOCK flag in filp->f_flags.

    Using the ioctl Argument


    If the extra argument is an integer, it’s easy: it can be used directly. If it is a pointer, however, some care must be taken.
    When a pointer is used to refer to user space, we must ensure that the user address is valid.
    To start, address verification (without transferring data) is implemented by the function access_ok, which is declared in <asm/uaccess.h>:
    
         int access_ok(int type, const void *addr, unsigned long size);
    
    where:
    • type
    • VERIFY_READ(reading the user-space memory area) or VERIFY_WRITE(writing the user-space memory area) which is a superset of VERIFY_READ.
    • addr
    • a user-space address
    • size
    • a byte count
    access_ok() checks the address interval [ addr , addr + size – 1 ] then returns a boolean value: 1 for success (access is OK) and 0 for failure (access is not OK).
    The scull source exploits the bitfields in the ioctl number to check the arguments before the switch:
    
    int err = 0, tmp;
    int retval = 0;
    /*
    * extract the type and number bitfields, and don't decode
    * wrong cmds: return ENOTTY (inappropriate ioctl) before access_ok() */
    if (_IOC_TYPE(cmd) != SCULL_IOC_MAGIC) return -ENOTTY;
    if (_IOC_NR(cmd) > SCULL_IOC_MAXNR) return -ENOTTY;
    /*
     * the direction is a bitmask, and VERIFY_WRITE catches R/W
     * transfers. `Type' is user-oriented, while
     * access_ok is kernel-oriented, so the concept of "read" and
     * "write" is reversed
     */
    if (_IOC_DIR(cmd) & _IOC_READ)
         err = !access_ok(VERIFY_WRITE, (void __user *)arg, _IOC_SIZE(cmd));
    else if (_IOC_DIR(cmd) & _IOC_WRITE)
         err = !access_ok(VERIFY_READ, (void __user *)arg, _IOC_SIZE(cmd));
    if (err) return -EFAULT;
    
    There are the main single-value transfer routines defined in <asm-generic/uaccess.h>. They automatically use the right size if we just have the right pointer type. This version just falls back to copy_{from,to}_user, which should provide a fast-path for small values. They are relatively fast and should be called instead of copy_to_user whenever single values are being transferred.
    
    #define __put_user(x, ptr) \
    ({        \
     __typeof__(*(ptr)) __x = (x);    \
     int __pu_err = -EFAULT;     \
            __chk_user_ptr(ptr);                                    \
     switch (sizeof (*(ptr))) {    \
     case 1:       \
     case 2:       \
     case 4:       \
     case 8:       \
      __pu_err = __put_user_fn(sizeof (*(ptr)), \
          ptr, &__x);  \
      break;      \
     default:      \
      __put_user_bad();    \
      break;      \
      }       \
     __pu_err;      \
    })
    
    #define put_user(x, ptr)     \
    ({        \
     void __user *__p = (ptr);    \
     might_fault();      \
     access_ok(VERIFY_WRITE, __p, sizeof(*ptr)) ?  \
      __put_user((x), ((__typeof__(*(ptr)) __user *)__p)) : \
      -EFAULT;     \
    })
    
    • put_user() checks to ensure that the process is able to write to the given memory address.
    • __put_user() should only be used if the memory region has already been verified with access_ok.
    • get_user(local, ptr) and __get_user(local, ptr) behave like put_user and __put_user, but transfer data in the opposite direction.

    Capabilities and Restricted Operations


    For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero).
    Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list).
    Starting with kernel 2.2, Linux divides the privileges into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute. In this way, a particular user (or program) can be empowered to perform a specific privileged operation without giving away the ability to perform other unrelated operations. There are two system calls used to allow capabilities to be managed from user space: capget and capset.
    The full set of capabilities can be found in <linux/capability.h>.
    Capability checks are performed with the capable() function (defined in <linux/sched.h>):
    
          int capable(int capability);
    

    Some system administration operations use application bound to this capability CAP_SYS_ADMIN. You can check if a user’s privilege has this level :
    
          if (! capable(CAP_SYS_ADMIN) )
                 return -EPERM;
    

    The Implementation of the ioctl Commands


    The scull implementation of ioctl only transfers the configurable parameters of the device:
    
    switch(cmd) {
      case SCULL_IOCRESET:
        scull_quantum = SCULL_QUANTUM;
        scull_qset = SCULL_QSET;
        break;
      case SCULL_IOCSQUANTUM: /* Set: arg points to the value */
        if (! capable (CAP_SYS_ADMIN))
           return -EPERM;
        retval = __get_user(scull_quantum, (int __user *)arg); break;
      case SCULL_IOCTQUANTUM: /* Tell: arg is the value */
        if (! capable (CAP_SYS_ADMIN))
            return -EPERM;
        scull_quantum = arg;
        break;
      case SCULL_IOCGQUANTUM: /* Get: arg is pointer to result */
        retval = __put_user(scull_quantum, (int __user *)arg);
        break;
      case SCULL_IOCQQUANTUM: /* Query: return it (it's positive) */
        return scull_quantum;
      case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
        if (! capable (CAP_SYS_ADMIN))
           return -EPERM;
        tmp = scull_quantum;
        retval = __get_user(scull_quantum, (int __user *)arg); if (retval == 0)
        retval = __put_user(tmp, (int __user *)arg); break;
      case SCULL_IOCHQUANTUM: /* sHift: like Tell + Query */
        if (! capable (CAP_SYS_ADMIN))
           return -EPERM;
        tmp = scull_quantum;
        scull_quantum = arg;
        return tmp;
      default:  /* redundant, as cmd was checked against MAXNR */
             return -ENOTTY;
     }
    return retval;
    
    The corresponding ways to pass and receive arguments from the caller’s point of view (i.e., from user space):
    
        ioctl(fd,SCULL_IOCSQUANTUM, &quantum);
        ioctl(fd,SCULL_IOCTQUANTUM, quantum);
        ioctl(fd,SCULL_IOCGQUANTUM, &quantum);
        quantum = ioctl(fd,SCULL_IOCQQUANTUM);
        ioctl(fd,SCULL_IOCXQUANTUM, &quantum);
        quantum = ioctl(fd,SCULL_IOCHQUANTUM, quantum);
    

    Device Control Without ioctl


    Controlling by write is definitely the way to go for those devices that don’t transfer data but just respond to commands, such as robotic devices.

    For instance, the “device” is simply a pair of old stepper motors, the driver interprets what is being written as ASCII commands and converts the requests to sequences of impulses that manipulate the stepper motors.

    Blocking I/O


    This section shows how to put a process to sleep and wake it up again later on.

    Introduction to Sleeping


    When a process is put to sleep, it is marked as being in a special state and removed from the scheduler’s run queue.
    Never sleep when you are running in an atomic context such as holding a spinlock.
    Making it possible for your sleeping process to be found, it is accomplished through a data structure called a wait queue. A wait queue represents a set of sleeping processes, which are woken up by the kernel when some condition becomes true.
    Wait queues are implemented as doubly linked lists whose elements include pointers to process descriptors. The doubly linked lists must be protected from concurrent accesses and is achieved by the spinlock_t lock in the wait queue head.
    Each wait queue is identified by a wait queue head, a data structure of type wait_queue_head_t defined in <linux/wait.h>:
    
    struct wait_queue_entry {
    	unsigned int		flags;
    	void			*private;
    	wait_queue_func_t	func;
    	struct list_head	entry;
    };
    
    struct wait_queue_head {
     spinlock_t  lock;
     struct list_head head;
    };
    typedef struct wait_queue_head wait_queue_head_t;
    
    /*
     * Macros for declaration and initialisaton of the datatypes
     */
    
    #define __WAITQUEUE_INITIALIZER(name, tsk) {     \
     .private = tsk,       \
     .func  = default_wake_function,    \
     .entry  = { NULL, NULL } }
    
    #define DECLARE_WAITQUEUE(name, tsk)      \
     struct wait_queue_entry name = __WAITQUEUE_INITIALIZER(name, tsk)
    
    #define __WAIT_QUEUE_HEAD_INITIALIZER(name) {     \
     .lock  = __SPIN_LOCK_UNLOCKED(name.lock),   \
     .head  = { &(name).head, &(name).head } }
    
    #define DECLARE_WAIT_QUEUE_HEAD(name) \
     struct wait_queue_head name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
    
    extern void __init_waitqueue_head(struct wait_queue_head *wq_head, const char *name, struct lock_class_key *);
    
    #define init_waitqueue_head(wq_head)      \
     do {         \
      static struct lock_class_key __key;    \
              \
      __init_waitqueue_head((wq_head), #wq_head, &__key);  \
     } while (0)
    
    

    A wait queue head can be initialized
    • statically
    • 
           DECLARE_WAIT_QUEUE_HEAD(my_queue);
      
    • dynamically
    • 
           wait_queue_head_t my_queue;
           init_waitqueue_head(&my_queue);
      
      

    Simple Sleeping


    The simplest way of sleeping in the Linux kernel is a macro called wait_event(). It combines handling the details of sleeping with a check on the condition a process is waiting for.
    
         wait_event(queue, condition)
         wait_event_interruptible(queue, condition)
         wait_event_timeout(queue, condition, timeout)
         wait_event_interruptible_timeout(queue, condition, timeout)
    
    The condition is an arbitrary boolean expression that is evaluated by the macro.
    The process is put to sleep (TASK_INTERRUPTIBLE) until the condition evaluates to true or a signal is received. The condition is checked each time the wait queue is woken up.

    The preferred alternative is wait_event_interruptible(), which can be interrupted by signals. This version returns an integer value that you should check: a nonzero value means your sleep was interrupted by some sort of signal, and your driver should probably return -ERESTARTSYS. A zero value means the condition is true and the waitqueue is woken up.
    On the other side, the basic function that wakes up sleeping processes is called wake_up.

    
         void wake_up(wait_queue_head_t *queue);
         void wake_up_interruptible(wait_queue_head_t *queue);
    
    wake_up() wakes up all processes waiting on the given queue.

    A sample module called sleepy which implements a device with simple behavior:
    • Any process that attempts to read from the device is put to sleep.
    • Whenever a process writes to the device, all sleeping processes are awakened.
    
         static DECLARE_WAIT_QUEUE_HEAD(wq);
         static int flag = 0;
    
         ssize_t sleepy_read (struct file *filp, char __user *buf, size_t count, loff_t *pos) {
             printk(KERN_DEBUG "process %i (%s) going to sleep\n", current->pid, current->comm);
             wait_event_interruptible(wq, flag != 0);
             flag = 0;
             printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
             return 0; /* EOF */
         }
    
         ssize_t sleepy_write (struct file *filp, const char __user *buf, size_t count, loff_t *pos)
         {
             printk(KERN_DEBUG "process %i (%s) awakening the readers...\n", current->pid, current->comm);
             flag = 1;
             wake_up_interruptible(&wq);
             return count; /* succeed, to avoid retrial */
         }
    
    


    Blocking and Nonblocking Operations


    Explicitly nonblocking I/O is indicated by the O_NONBLOCK flag in filp->f_flags. O_NDELAY flag is an alternate name for O_NONBLOCK, accepted for compatibility with System V code.
    The blocking operation is the default standard semantics:
    • If a process calls read() but no data is (yet) available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less than the amount requested in the count argument to the method.
    • The input buffer is required to avoid losing data that arrives when nobody is reading.
    • If a process calls write() and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write() call succeeds, although the data may be only partially written if there isn’t room in the buffer for the count bytes that were requested.
    • The output buffer is useful for increasing performance of the hardware.

    A Blocking I/O Example


    This example is taken from the scullpipe driver; it is a special form of scull that implements a pipe-like device.
    This driver can be run without requiring any particular hardware or an interrupt handler:
    • writing process wakes the reading process
    • reading processes are used to wake writer processes that are waiting for buffer space to become available
    The device driver uses a device structure that contains two wait queues and a buffer.
    
    struct scull_pipe {
      wait_queue_head_t inq, outq;
      char *buffer, *end;
      int buffersize;
      char *rp, *wp;
      int nreaders, nwriters;
      struct fasync_struct *async_queue; /* asynchronous readers */
      struct semaphore sem;              /* mutual exclusion semaphore */
      struct cdev cdev;      
    };
     
    

    The read implementation manages both blocking and nonblocking input and looks like this:
    
    static ssize_t scull_p_read (struct file *filp, char __user *buf, size_t count, loff_t *f_pos)
    {
        struct scull_pipe *dev = filp->private_data;
        if (down_interruptible(&dev->sem))
            return -ERESTARTSYS;
    
        while (dev->rp == dev->wp) { /* nothing to read */ 
            up(&dev->sem); /* release the lock */
            if (filp->f_flags & O_NONBLOCK)
                return -EAGAIN;
            PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
            if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
                return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
            /* otherwise loop, but first reacquire the lock */
            if (down_interruptible(&dev->sem))
                return -ERESTARTSYS;
        }
    
        /* ok, data is there, return something */
        if (dev->wp > dev->rp)
                    count = min(count, (size_t)(dev->wp - dev->rp));
        else /* the write pointer has wrapped, return data up to dev->end */
                    count = min(count, (size_t)(dev->end - dev->rp));
        if (copy_to_user(buf, dev->rp, count)) {
                    up (&dev->sem);
                    return -EFAULT;
        }
        dev->rp += count;
        if (dev->rp == dev->end)
                   dev->rp = dev->buffer; /* wrapped */
        up (&dev->sem);
        /* finally, awake any writers and return */
        wake_up_interruptible(&dev->outq);
        PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
        return count;
    }
    
    When we exit from the while loop, we know that the semaphore is held and the buffer contains data that we can use.

    Task Structure and Process Table

    Every process under Linux is dynamically allocated a struct task_struct structure.
    
    /* Used in tsk->state: */
    #define TASK_RUNNING			0x0000
    #define TASK_INTERRUPTIBLE		0x0001
    #define TASK_UNINTERRUPTIBLE		0x0002
    #define __TASK_STOPPED			0x0004
    #define __TASK_TRACED			0x0008
    
    struct task_struct {
      ...
      volatile long			state;
      ...
    };
    
    Under Linux, there are three kinds of processes:
    • the idle thread(s)
    • The idle thread is created at compile time for the first CPU; it is then "manually" created for each CPU by means of arch-specific fork_by_hand()
    • kernel threads
    • Kernel threads are created using kernel_thread() function which invokes the clone(2) system call in kernel mode.
      Kernel threads usually have no user address space, i.e. p->mm = NULL, kernel threads can always access kernel address space directly. They are allocated pid numbers in the low range.
    • user tasks
    • User tasks are created by means of clone(2) or fork(2) system calls, both of which internally invoke kernel/fork.c: do_fork().

    Advanced Sleeping


    To get an understanding of what is really going on when a process sleeps.

    How a process sleeps


    The way in which an operating system chooses which process at a given time has access to a system’s CPU(s) is controlled by a scheduler.
    The job of a scheduler is to arbitrate access to the current CPU between multiple processes.
    Linux triggers the scheduler by using a timer interrupt. The amount of time between the timer interrupt is called a timeslice.
    Schedulers typically use some type of process queue to manage the execution of processes on the system. In Linux, this process queue is called the run queue.
    In Linux, the run queue is composed of two priority arrays:
    • Active
    • Store processes that have not yet used up their timeslice
    • Expired
    • Store processes that have used up their timeslice
    From a high level, the scheduler’s job in Linux is to take the highest priority active processes, let them use the CPU to execute, and place them in the expired array when they use up their timeslice.

    When a timer event occurs, the current process is put on hold and the Linux kernel itself takes control of the CPU.
    When the timer event finishes, the Linux kernel normally passes control back to the process that was put on hold. However, when the held process has been marked as needing rescheduling, the kernel calls schedule() to choose which process to activate instead of the process that was executing before the kernel took control. The process that was executing before the kernel took control is called the current process.
    Whenever you call schedule(), you are telling the kernel to consider which process should be running and to switch control to that process if necessary. So you never know how long it will be before schedule returns to your code.

    There are several task states defined in <linux/sched.h>, two states are associated with sleeping :
    • TASK_RUNNING
    • the process is able to run
    • TASK_INTERRUPTIBLE
    • process is asleep
    • TASK_UNINTERRUPTIBLE
    • process is asleep and ignore signals
    Tasks that are sleeping (blocked) are in one of these 2 special non-runnable states.
    Both types of sleeping tasks sit on a wait queue, waiting for an event to occur, and are not runnable.
    This is important because without this special state, the scheduler would select tasks that did not want to run or, worse, sleeping would have to be implemented as busy looping.

    Sleeping is handled via wait queues. A wait queue is a simple list of processes waiting for an event to occur.
    Wait queues are represented in the kernel by wake_queue_head_t.
    
    struct list_head {
     struct list_head *next, *prev;
    };
    
    struct wait_queue_head {
     spinlock_t  lock;
     struct list_head head;
    };
    typedef struct wait_queue_head wait_queue_head_t;
    
    
    The wait_queue_t structure contains information about the sleeping process and exactly how it would like to be woken up.
    Wait queues are created statically via DECLARE_WAITQUEUE() or dynamically via init_waitqueue_head().
    
    #define __WAITQUEUE_INITIALIZER(name, tsk) {     \
     .private = tsk,       \
     .func  = default_wake_function,    \
     .entry  = { NULL, NULL } }
    
    #define DECLARE_WAITQUEUE(name, tsk)      \
     struct wait_queue_entry name = __WAITQUEUE_INITIALIZER(name, tsk)
    
    
    For ex.,
    
    DECLARE_WAITQUEUE(my_wait, current)
    
    

    Manual sleeps


    The task performs the following steps to add itself to a wait queue:
    • the creation and initialization of a wait queue entry
    • 
      		DEFINE_WAIT(my_wait); // this is equivalent to DECLARE_WAITQUEUE(my_wait, current)
      	
      In include/linux/wait.h:
      
      #define DEFINE_WAIT_FUNC(name, function)     \
       struct wait_queue_entry name = {     \
        .private = current,     \
        .func  = function,     \
        .entry  = LIST_HEAD_INIT((name).entry),   \
       }
      
      #define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function)
      	
    • Adds itself to a wait queue via add_wait_queue()
    • 
      void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
      	
      This wait queue awakens the process when the condition for which it is waiting occurs. Of course, there needs to be code elsewhere that calls wake_up() on the queue when the event actually does occur.
    • change the process state by prepare_to_wait()
    • 
      while (!condition) { /* condition is the event that we are waiting for */
          prepare_to_wait(&q, &my_wait, state); 
          if (signal_pending(current)) /* handle signal */ 
              break;
          schedule();
      }
      	

      state should be TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.
      If the state is set to TASK_INTERRUPTIBLE, a signal wakes the process up. The call to signal_pending() tells us whether we were awakened by a signal; if so, we need to return to the user and let them try again later. Otherwise, we reacquire the semaphore, and test again. After calling prepare_to_wait(), the process can call schedule(). When the task awakens, it again checks whether the condition is true. If it is, it exits the loop. Otherwise, it again calls schedule() and repeats.

    • removes itself from the wait queue
    • The task sets itself to TASK_RUNNING and removes itself from the wait queue via finish_wait():
      
      finish_wait(&q, &my_wait);
      
      
    The example code that handles the actual sleep is:
    
    if (down_interruptible(&dev->sem))
         return -ERESTARTSYS;
    
    while (spacefree(dev) == 0) { /* full */ 
         DEFINE_WAIT(wait);
    
         up(&dev->sem);
         if (filp->f_flags & O_NONBLOCK)
             return -EAGAIN;
         PDEBUG("\"%s\" writing: going to sleep\n",current->comm); 
         prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
         if (spacefree(dev) == 0)
             schedule( ); 
         finish_wait(&dev->outq, &wait); 
         if ( signal_pending(current) )
             return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
         if (down_interruptible(&dev->sem))
             return -ERESTARTSYS;
    }
    up(&dev->sem);
    
    

    Exclusive waits


    If the number of processes in the wait queue is large, the sudden of many processes awaken can seriously degrade the performance of the system.

    poll and select

    留言

    熱門文章