Linux Device Drivers -III


Content
  • 7. Time, Delays, and Deferred Work
  • 8. Allocating Memory
  • 9. Communicating with Hardware
  • 10. Interrupt Handling

Chapter 7: Time, Delays, and Deferred Work


Measuring Time Lapses


Timer interrupts are generated by the system’s timing hardware at regular intervals; this interval is programmed at boot time by the kernel according to the value of HZ.
HZ is an architecture-dependent value, the popular x86 PC defaults to 1000.
Every time a timer interrupt occurs, the value of an internal kernel counter " jiffies" is incremented. The counter is initialized to 0 at system boot, so it represents the number of clock ticks since last boot.

Using the jiffies Counter


#include <linux/jiffies.h>

unsigned long j, stamp_1, stamp_half, stamp_n;
j = jiffies;                 /* read the current value */
stamp_1 = j + HZ;            /* 1 second in the future */
stamp_half = j + HZ/2;       /* half a second */
stamp_n = j + n * HZ / 1000; /* n milliseconds */

To compare jiffies samples,

int time_after(unsigned long a, unsigned long b);
int time_before(unsigned long a, unsigned long b);
int time_after_eq(unsigned long a, unsigned long b);
int time_before_eq(unsigned long a, unsigned long b);

The kernel exports four helper functions to convert time values expressed as jiffies to and from struct timeval and struct timespec used in user space program:

#include <linux/time.h>

unsigned long timespec_to_jiffies(struct timespec *value);
void jiffies_to_timespec(unsigned long jiffies, struct timespec *value);
unsigned long timeval_to_jiffies(struct timeval *value);
void jiffies_to_timeval(unsigned long jiffies, struct timeval *value);
If you need to read the 64-bit jiffies:

u64 get_jiffies_64(void);

Processor-Specific Registers


If you need extremely high precision, you can resort to platform-dependent resources such as the TSC (timestamp counter), introduced in x86 processors.

Knowing the Current Time


Looking at jiffies is almost always sufficient when you need to measure time intervals.
Dealing with real-world time is usually best left to user space, where the C library offers better support.
There is a kernel function that turns a wall-clock time into a jiffies value,

#include <linux/time.h>

unsigned long mktime (unsigned int year, unsigned int mon,
                          unsigned int day, unsigned int hour,
                          unsigned int min, unsigned int sec);

Delaying Execution


In this chapter, we use the phrase “long delay" to refer to a multiple-jiffy delay.

Long Delays


Occasionally a driver needs to delay execution for relatively long periods—more than one clock tick.

Busy waiting


The easiest implementation is a loop that monitors the jiffy counter:

while ( time_before(jiffies, target_j) )
    cpu_relax( );

The call to cpu_relax() invokes an architecture-specific way of saying that you’re not doing much with the processor at the moment.
This busy loop severely degrades system performance. If you didn’t configure your kernel for preemptive operation, the loop completely locks the processor for the duration of the delay; the scheduler never preempts a process that is running in
kernel space, and the computer looks completely dead until time target_j is reached.
This implementation of delaying code is available, in the jit module.
To test the busy-wait code, you can read /proc/jit/jitbusy, which busy-loops for one second for each line it returns.
The suggested command to read /proc/jit/jitbusy is:

 dd bs=20 count=3 < /proc/jitbusy
optionally specifying the number of blocks as well.
Each 20-byte line returned by the file represents the value the jiffy counter had before and after the delay. Delays are exactly one second (HZ jiffies) between each read.

Under Linux, user-space programs have always been preemptible : the kernel interrupts user-space programs to switch to other threads, using the regular clock tick. This means that an infinite loop in an user-space program cannot block the system.
If the kernel is not preemtible, an infinite loop in the kernel code will block the entire system.
So, kernel preemption has been introduced in 2.6 kernels, and one can enable or disable it using the CONFIG_PREEMPT option. If CONFIG_PREEMPT is enabled, then kernel code can be preempted everywhere, except when the code has disabled local interrupts.
If you repeat the command while running a preemptible kernel, and run a program which forks 50 processes, the individual delays are far longer than one second because the process has been interrupted during its delay, scheduling other processes.

Yielding the processor


Busy waiting imposes a heavy load on the system because it lock the CPU resource. A better way is to explicitly release the CPU when we’re not interested in it.
This is accomplished by calling the schedule( ) function, declared in <linux/sched.h>:

while ( time_before(jiffies, j1) ) {
    schedule( );
}

This loop can be tested by reading /proc/jit/jitsched.
The current process does nothing but release the CPU, but it remains in the run queue. The process is actually executing during the delay. Once a process releases the processor with schedule(), there are no guarantees that the process will get the processor back anytime soon. You can see that the delay associated to each line on the output is extended by a few seconds, because other processes are using the CPU when the timeout expires.

Sleeping with Timeouts


If you want to be sure that it waits for a condition within a certain period of time, the better way is to ask the kernel to do it for you:

#include <linux/wait.h>

long wait_event_timeout(wait_queue_head_t q, condition, long timeout);
long wait_event_interruptible_timeout(wait_queue_head_t q, condition, long timeout);
Note that the timeout value represents the number of jiffies to wait.
  • If the timeout expires, the functions return 0 ;
  • if the process is awakened by another event, it returns the remaining delay expressed in jiffies.
  • The return value is never negative, even if the delay is greater than expected because of system load.
For just sleep, no event to wait for, and uses 0 as a condition.
It is observed that /proc/jit/jitqueue is near optimal.

The wait queue head is not really used. The kernel offers the schedule_timeout() function to do the similar job:

#include <linux/sched.h>

signed long schedule_timeout(signed long timeout);
Make the current task sleep until timeout jiffies have elapsed.
The return value is 0 unless the function returns before the given timeout has elapsed (in response to a signal). The caller must set the current process state to be TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE before calling schedule_timeout() so that the scheduler won’t run the current process again:

set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(delay);

Short Delays


The clock tick is a longer latency than the delay needed by the HW drivers.
The kernel provides functions for short delays:

#include <linux/delay.h>
void ndelay(unsigned long nsecs);
void udelay(unsigned long usecs);
void mdelay(unsigned long msecs);

The actual implementations of the functions are in <asm/delay.h>, being architecture-specific, and sometimes build on an external function. Every architecture implements udelay.
It’s important to remember that the three short delay functions are busy-waiting; other tasks can’t be run during the time lapse.

There is another way of achieving millisecond (and longer) delays that does not involve busy waiting:

void msleep(unsigned int millisecs);
void ssleep(unsigned int seconds)

unsigned long msleep_interruptible(unsigned int millisecs);

Interrupts

An interrupt is an event that alters the normal execution flow of a program and can be generated by hardware devices or even by the CPU itself.
Interrupts can be grouped into two categories based on the source of the interrupt:
  • synchronous interrupts
  • Generated by executing an instruction.
    Synchronous interrupts, usually named exceptions, handle conditions detected by the processor itself in the course of executing an instruction.
    Divide by zero or a system call are examples of exceptions.
    There are two sources for exceptions:
    • processor detected
      • faults
      • traps
      • aborts
      • programmed
        • int n
    • asynchronous interrupts
    • generated by an external event.
      Asynchronous interrupts, usually named interrupts, are external events generated by I/O devices.
      For example a network card generates an interrupts to signal that a packet has arrived.
    • maskable interrupts
      • can be ignored
      • signalled via INT pin
    • non-maskable interrupts
      • cannot be ignored
      • signalled via NMI pin
    While an interrupt is handled (from the time the CPU jumps to the interrupt handler until the interrupt handler returns - e.g. IRET is issued) it is said that code runs in “interrupt context”.
    Code that runs in interrupt context has the following properties:
    • it runs as a result of an IRQ (not of an exception)
    • there is no well defined process context associated
    • not allowed to trigger a context switch (no sleep, schedule, or user memory access)

    Deferrable actions are used to run callback functions at a later time. If deferrable actions scheduled from an interrupt handler, the associated callback function will run after the interrupt handler has completed.
    There are two large categories of deferrable actions:

    • those that run in interrupt context
    • those that run in process context
    Running for too long with interrupts disabled can have undesired effects such as increased latency or poor system performance due to missing other interrupts.
    That is why to use interrupt context deferrable actions to avoid doing too much work in the interrupt handler function.

    In Linux there are 3 types of deferrable actions:

    • softIRQ
      • runs in interrupt context
      • statically allocated
      • same handler may run in parallel on multiple cores
    • tasklet
      • runs in interrupt context
      • can be dynamically allocated
      • same handler runs are serialized
    • workqueues
      • run in process context

    Softirqs


    With the advent of parallelisms in the Linux kernel, all new schemes of implementation of the bottom half handlers are built on the performance of the processor specific kernel thread that called ,ksoftirqd.
    The kernel provides a set of ksoftirqd kernel threads, one per CPU, just to run “soft interrupt” handlers.
    Each processor has its own thread that is called ksoftirqd/n where the n is the number of the processor.
    We can see it in the output of the systemd-cgls util:
    
    $ systemd-cgls -k | grep ksoft
    ├─   9 [ksoftirqd/0]
    ├─  18 [ksoftirqd/1]
    ├─  24 [ksoftirqd/2]
    ├─  30 [ksoftirqd/3]
    
    systemd-cgls recursively shows the contents of the selected Linux control group hierarchy in a tree., "-k" includes kernel threads in output.

    The spawn_ksoftirqd function starts these threads in early initcall:
    
    early_initcall(spawn_ksoftirqd);
    
    
    Softirqs are determined statically at compile-time of the Linux kernel using the softirq_vec array.
    The softirq_vec array may contain NR_SOFTIRQS (10) types of softirqs that has type softirq_action.
    
    static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
    
    enum
    {
            HI_SOFTIRQ=0,
            TIMER_SOFTIRQ,
            // two for networking
            NET_TX_SOFTIRQ,
            NET_RX_SOFTIRQ,
            // two for the block layer
            BLOCK_SOFTIRQ,
            BLOCK_IOPOLL_SOFTIRQ,
            TASKLET_SOFTIRQ,
            SCHED_SOFTIRQ,
            HRTIMER_SOFTIRQ,
            RCU_SOFTIRQ,
            NR_SOFTIRQS
    };
    
    const char * const softirq_to_name[NR_SOFTIRQS] = {
            "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
            "TASKLET", "SCHED", "HRTIMER", "RCU"
    };
    
    struct softirq_action
    {
             void    (*action)(struct softirq_action *);
    };
    
    The softirq_action structure consists a single field only: an action pointer to the softirq function

    We can see it in the output of the /proc/softirqs:
    
    $ cat /proc/softirqs
                        CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
              HI:      65208          6         13      60996          0          0          0          0
           TIMER:     899907     837399     927077     865850          0          0          0          0
          NET_TX:          2          3         65          7          0          0          0          0
          NET_RX:       1102       1174        914       6913          0          0          0          0
           BLOCK:      29141        344       8994      58669          0          0          0          0
        IRQ_POLL:          0          0          0          0          0          0          0          0
         TASKLET:        242         26       4900        166          0          0          0          0
           SCHED:    1002338     923765     999987     918435          0          0          0          0
         HRTIMER:          0          0        539          0          0          0          0          0
             RCU:     470947     468360     483743     481231          0          0          0          0
    
    This is the summary,
    • Each processor has its own "ksoftirqd" thread
    • Each "ksoftirqd" thread is setup to handle 10 softirq actions
    • Soft irqs usage is restricted, they are use by a handful of subsystems that have low latency requirements.
    • Checks of the existence of the deferred interrupts are performed periodically.
    • Each ksoftirqd kernel thread runs the ksoftirqd() function that checks existence of deferred interrupts and calls the __do_softirq() function depending on the result of the check. do_softirq() runs either after an interrupt handler or from the ksoftirqd kernel thread
      There are several points where these checks occur. The main point is the call of the do_IRQ() function defined in arch/x86/kernel/irq.c, which provides the main means for actual interrupt processing in the Linux kernel. When do_IRQ() finishes handling an interrupt, it calls the exiting_irq() function from the arch/x86/include/asm/apic.h that expands to the call of the irq_exit() function. irq_exit() checks for deferred interrupts and the current context and calls the invoke_softirq() function:
      
      if (!in_interrupt() && local_softirq_pending())
          invoke_softirq();
      

    Tasklets


    Tasklets are a dynamic type (not limited to a fixed number) of deferred work running in interrupt context.
    Tasklets are implemented on top of 2 dedicated softirqs: TASKLET_SOFITIRQ and HI_SOFTIRQ.
    The tasklet mechanism is mostly used in interrupt management.
    Tasklets looks like the kernel timers in some ways:
    • They are always run at interrupt time
    • executed (in atomic mode) in the context of a “soft interrupt,” a kernel mechanism that executes asynchronous tasks with hardware interrupts enabled.
    • they always run on the same CPU which schedules them
    • they receive an unsigned long argument
    Unlike kernel timers in other ways:
    • you can’t ask to execute the function at a specific time.
    • you simply ask for it to be executed at a later time chosen by the kernel.

    Tasklets offer a number of interesting features:
    • A tasklet can be disabled and re-enabled later; it won’t be executed until it is enabled as many times as it has been disabled.
    • Just like timers, a tasklet can reregister itself.
    • A tasklet can be scheduled to execute at normal priority or high priority.
    To use the tasklet:
    • initialization
    • 
      #include <linux/interrupt.h>
      struct tasklet_struct {
       struct tasklet_struct *next;
       unsigned long state;
       atomic_t count;
       void (*func)(unsigned long);
       unsigned long data;
      };
      
      void tasklet_init(struct tasklet_struct *t,
          void (*func)(unsigned long), unsigned long data);
      
      #define DECLARE_TASKLET(name, func, data) \
      struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
      
      #define DECLARE_TASKLET_DISABLED(name, func, data) \
      struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }
      
      
    • activation
      • void tasklet_enable(struct tasklet_struct *t)
      • Enables a tasklet that had been previously disabled.
      • void tasklet_disable(struct tasklet_struct *t)
      • This function disables the given tasklet. The tasklet may still be scheduled with tasklet_schedule(), but its execution is deferred until the tasklet has been enabled again. If the tasklet is currently running, this function busy-waits until the tasklet exits
      • void tasklet_disable_nosync(struct tasklet_struct *t)
      • void tasklet_schedule(struct tasklet_struct *t)
      • Schedule the tasklet for execution. If a tasklet is scheduled again before it has a chance to run, it runs only once. However, if it is scheduled while it runs, it runs again after it completes. This behavior also allows a tasklet to reschedule itself.
      • void tasklet_hi_schedule(struct tasklet_struct *t)
      • Schedule the tasklet for execution with higher priority. When the soft interrupt handler runs, it deals with high-priority tasklets before other soft interrupt tasks, including “normal” tasklets.
    • stop
      • void tasklet_kill(struct tasklet_struct *t)
      • This function ensures that the tasklet is not scheduled to run again; it is usually called when a device is being closed or the module removed.
      • void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu)
    To test /proc/jit/tasklet and /proc/jit/tasklethi.
    
    

    Work Queues


    Workqueues are a type of deferred work that runs in process context.
    They are implemented on top of kernel threads.
    When an asynchronous execution context is needed, a work item describing which function to execute is put on a queue. An independent thread serves as the asynchronous execution context. The queue is called work queue and the thread is called worker.
    The worker executes the functions associated with the work items one after the other:
    • When there is no work item left on the workqueue the worker becomes idle.
    • When a new work item gets queued, the worker begins executing again.
    A work item is a simple struct that holds a pointer to the function that is to be executed asynchronously.
    Whenever a driver or subsystem wants a function to be executed asynchronously it has to set up a work item pointing to that function and queue that work item on a workqueue.
    Special purpose threads, called worker threads, execute the functions off of the queue, one after the other.

    The work queues allow kernel functions to be activated and later executed(much like deferrable functions) by special kernel threads called the worker threads.

    The main difference between deferrable functions and work queues is that:
    • deferrable functions run in interrupt context
    • All code must be atomic.
    • functions in work queues run in process context
    • Running in process context is the only way to execute functions that can block(sleep).

    The main data structure associated with a work queue is a descriptor defined in kernel/workqueue.c :
    
    #include <linux/workqueue.h>
    
    struct workqueue_struct {
    ...
     char   name[WQ_NAME_LEN]; /* I: workqueue name */
    ...
    };
    
    EXPORT_SYMBOL(system_wq);
    EXPORT_SYMBOL_GPL(system_highpri_wq);
    EXPORT_SYMBOL_GPL(system_long_wq);
    EXPORT_SYMBOL_GPL(system_unbound_wq);
    EXPORT_SYMBOL_GPL(system_freezable_wq);
    EXPORT_SYMBOL_GPL(system_power_efficient_wq);
    EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq);
    
    

    A workqueue must be explicitly created before use.
    Each workqueue has one or more dedicated processes (“kernel threads”), which run functions submitted to the queue.

    Workqueue APIs


    • Allocates a workqueue
    • 
      struct workqueue_struct *alloc_workqueue(const char *fmt,
            unsigned int flags,
            int max_active, ...)
      
      • fmt
      • printf format for the name of the workqueue
      • flags
      • WQ_* flags
      • max_active
      • max in-flight work items, 0 for default
      • args...
      • args for fmt
      Allocate a workqueue with the specified parameters.
    • ()


    Kernel Timers


    Kernel timers can let you to schedule an action: the execution of a function at a particular time in the future. For example, polling a device by checking its state at regular intervals.

    A kernel timer is a data structure that instructs the kernel to execute a user-defined function with a user-defined argument at a user-defined time.

    The functions scheduled to run almost certainly do not run while the process that registered them is executing. In fact, kernel timers are implemented on top of the softirq(TIMER_SOFTIRQ) so that timer functions must be atomic.
    The rules for atomic contexts must be followed because you are outside of process context (i.e., in interrupt context):
    • Access to user space is NOT allowed.
    • The current pointer is not meaningful in atomic mode and cannot be used since the relevant code has no connection with the process that has been interrupted.
    • No sleeping or scheduling may be performed.
    • Atomic code may not call schedule() or a form of wait_event(), nor may it call any other function that could sleep. For example, calling kmalloc(..., GFP_KERNEL) is against the rules. Semaphores also must not be used since they can sleep.
    Kernel code can tell if it is running in interrupt context by calling the function in_interrupt( ), which takes no parameters and returns nonzero if the processor is currently running in interrupt context, either hardware interrupt or software interrupt.
    
    /*
     * Are we doing bottom half or hardware interrupt processing?
     *
     * in_irq()       - We're in (hard) IRQ context
     * in_softirq()   - We have BH disabled, or are processing softirqs
     * in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
     * in_serving_softirq() - We're in softirq context
     * in_nmi()       - We're in NMI context
     * in_task()	  - We're in task context
     *
     * Note: due to the BH disabled confusion: in_softirq(),in_interrupt() really
     *       should not be used in new code.
     */
    

    Kernel timer is asynchronous with other code, any data structures accessed by the timer function should be protected from concurrent access, either by being atomic types or by using spinlocks.

    The Timer API


    The kernel provides drivers with a number of functions to declare, register, and remove kernel timers.
    include/linux/timer.h:
    • struct timer_list
    • 
      struct timer_list {
       struct hlist_node	entry;
       unsigned long  expires;
       void   (*function)(struct timer_list *);
       u32  flags;
      #ifdef CONFIG_LOCKDEP
       struct lockdep_map	lockdep_map;
      #endif
      };
      
    • from_timer(var, callback_timer, timer_fieldname)
    • 
      #define from_timer(var, callback_timer, timer_fieldname) \
       container_of(callback_timer, typeof(*var), timer_fieldname)
      
    • timer initialization
    • Regular timer initialization should use either DEFINE_TIMER() or timer_setup().
      
      #define DEFINE_TIMER(_name, _function)				\
      	struct timer_list _name =				\
      		__TIMER_INITIALIZER(_function, 0)
      
      #define timer_setup(timer, callback, flags)			\
      	__init_timer((timer), (callback), (flags))
          
      
    • start a timer
    • The kernel will do a ->function() callback from the timer interrupt at the ->expires point in the future.
      
      void add_timer(struct timer_list * timer);
      
    • deactivates a timer
    • this works on both active and inactive timers.
      
      int del_timer(struct timer_list * timer);
      int del_timer_sync(struct timer_list *timer);
      
      Works like del_timer(), but also guarantees that when it returns, the timer function is not running on any CPU. This function should be preferred over del_timer() to avoid race conditions on SMP systems.
    The expires field represents the jiffies value when the timer's function is expected to run; at that time.
    You can bundle multiple data as a single data structure and pass a pointer cast to unsigned long.

    The example timer used to generate /proc/jitimer data is run every 10 jiffies by default,

    • define the timer function and data used in the timer
    • 
      /* This data structure is used as "data" for the timer and tasklet functions. */
      struct jit_data {
       struct timer_list timer;
      ...
       wait_queue_head_t wait;
       unsigned long prevjiffies;
       unsigned char *buf;
       int loops;
      };
      
      void jit_timer_fn(unsigned long arg)
      {
       struct jit_data *data = (struct jit_data *) arg;
       unsigned long j = jiffies;
       data->buf += sprintf(data->buf, "%9li  %3li     %i    %6i   %i   %s\n",
              j, j - data->prevjiffies, in_interrupt() ? 1 : 0,
              current->pid, smp_processor_id(), current->comm);
      
       if (--data->loops) {
        data->timer.expires += tdelay;
        data->prevjiffies = j;
        add_timer(&data->timer);
       } else {
        wake_up_interruptible(&data->wait);
       }
      }
      
    • register the timer
    • 
      int jit_timer(char *buf, char **start, off_t offset,
             int len, int *eof, void *unused_data)
      {
       struct jit_data *data;
       char *buf2 = buf;
       unsigned long j = jiffies;
      
       data = kmalloc(sizeof(*data), GFP_KERNEL);
       if (!data)
        return -ENOMEM;
      
       init_timer(&data->timer);
       init_waitqueue_head(&data->wait);
      
       /* Write the first lines in the buffer. */
       buf2 += sprintf(buf2, "   time   delta  inirq    pid   cpu command\n");
       buf2 += sprintf(buf2, "%9li  %3li     %i    %6i   %i   %s\n",
         j, 0L, in_interrupt() ? 1 : 0,
         current->pid, smp_processor_id(), current->comm);
      
       /* fill the data for our timer function */
       data->prevjiffies = j;
       data->buf = buf2;
       data->loops = JIT_ASYNC_LOOPS;
      
       /* register the timer */
       data->timer.data = (unsigned long) data;
       data->timer.function = jit_timer_fn;
       data->timer.expires = j + tdelay; /* parameter */
       add_timer(&data->timer);
       /* wait for the buffer to fill */
       wait_event_interruptible(data->wait, !data->loops);
      
       if (signal_pending(current))
        return -ERESTARTSYS;
       buf2 = data->buf;
       kfree(data);
       *eof = 1;
       return buf2 - buf;
      }
      
    • remove timers
    • 
      
      

    The Implementation of Kernel Timers


    Whenever kernel code registers a timer, the operation is eventually performed by internal_add_timer() which, in turn, adds the new timer to a double-linked list of timers within a “cascading table” associated to the current CPU.
    When __run_timers() is fired, it executes all pending timers for the current timer tick.
    Keep in mind, however, that a kernel timer is far from perfect, as it suffers from interruption and other artifacts induced by:
    • hardware interrupts
    • other timers
    • asynchronous tasks



    Chapter 8: Allocating Memory


    Memory mapping

    Overview

    In the Linux kernel it is possible to map a kernel address space to a user address space.
    This eliminates the overhead of copying user space information into the kernel space and vice versa.
    This can be done through a device driver and the user space device interface (/dev).
    This feature can be used by implementing the mmap() operation in the device driver's struct file_operations and using the mmap() system call in user space.

    The basic unit for virtual memory management is a page, which size is usually 4K, but it can be up to 64K on some platforms.
    Whenever we work with virtual memory we work with two types of addresses:

    • virtual address
    • physical address
    All CPU access (including from kernel space) uses virtual addresses that are translated by the MMU into physical addresses with the help of page tables.

    A physical page of memory is identified by the Page Frame Number (PFN).
    The PFN can be easily computed from the physical address by dividing it with the size of the page (or by shifting the physical address with PAGE_SHIFT bits to the right).

    For efficiency reasons, the virtual address space is divided into user space and kernel space:
    • user space
    • kernel space
    • The kernel space contains a memory mapped zone, called lowmem, which is contiguously mapped in physical memory, starting from the lowest possible physical address (usually 0).
      The virtual address where lowmem is mapped is defined by PAGE_OFFSET.
      There is a separate zone in kernel space called highmem which can be used to arbitrarily map physical memory.
      • Memory allocated by kmalloc() resides in lowmem and it is physically contiguous.
      • Memory allocated by vmalloc() is not contiguous and does not reside in lowmem (it has a dedicated zone in highmem).

    Chapter 9: Communicating with Hardware

    I/O Ports and I/O Memory

    At the hardware level, there is no conceptual difference between memory regions and I/O regions. While some CPU manufacturers implement a single address space in their chips, others decided that peripheral devices are different from memory and, therefore, deserve a separate address space. Linux implements the concept of I/O ports on all computer platforms it runs on, even on platforms where the CPU implements a single address space.

    I/O Registers and Conventional Memory

    Optimization and Memory Barriers

    When using optimizing compilers, you should never take for granted that instruc- tions will be performed in the exact order in which they appear in the source code. For example, a compiler might reorder the assembly language instructions in such a way to optimize how registers are used. The compiler can cache data values into CPU registers without writing them to memory, and even if it stores them, both write and read operations can operate on cache memory without ever reaching physical RAM. Moreover, modern CPUs usually execute several instructions in parallel and might reorder memory accesses. The solution to compiler optimization and hardware reordering is to place a memory barrier. A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction. Operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier. Memory barrier is used to avoid instruction reordering. An optimization barrier primitive ensures that the assembly language instructions corresponding to C statements placed before the primitive are not mixed by the compiler with assembly language instructions corresponding to C statements placed after the primitive. In Linux the barrier() macro, which expands into
    
      asm volatile(""::: "memory"),
    
    acts as an optimization barrier:
    • The asm instruction tells the compiler to insert an assembly language fragment
    • The volatile keyword for- bids the compiler to reshuffle the asm instruction with the other instructions of the program.
    • The memory keyword forces the compiler to assume that all memory locations in RAM have been changed by the assembly language instruction
    A memory barrier primitive ensures that the operations placed before the primitive are finished before starting the operations placed after the primitive. Linux uses a few memory barrier primitives, these primitives act also as optimization barriers:
    • mb()
    • Memory barrier for MP and UP.
    • rmb()
    • Read memory barrier for MP and UP.
    • wmb()
    • Write memory barrier for MP and UP.
    • smp_mb()
    • Memory barrier for MP only
    • smp_rmb()
    • Read memory barrier for MP only
    • smp_wmb()
    • Write memory barrier for MP only.
    The smp_xxx() primitives are used whenever the memory barrier should prevent race conditions that might occur only in multiprocessor systems; in uniprocessor systems, they do nothing. The other memory barriers are used to prevent race conditions occurring both in uniprocessor and multiprocessor systems.

    Using I/O Ports

    Using I/O Memory

    The main mechanism used to communicate with devices is through memory-mapped registers and device memory. Both are called I/O memory. The way to access I/O memory depends on the computer architecture, bus, and device being used. Depending on the computer platform and bus being used, I/O memory may or may not be accessed through page tables. When access passes though page tables, the kernel must first arrange for the physical address to be visible by your driver, and this usually means that you must call ioremap before doing any I/O. If no page tables are needed, I/O memory locations look pretty much like I/O ports, and you can just read and write to them using proper wrapper functions.

    I/O Memory Allocation and Mapping

    I/O memory regions must be allocated prior to use:
    
    <linux/ioport.h>
    
    struct resource *request_mem_region(unsigned long start, unsigned long len, char *name);
    
    All I/O memory allocations are listed in /proc/iomem. Memory regions should be freed when no longer needed:
    
    void release_mem_region(unsigned long start, unsigned long len)
    
    I/O memory is not directly accessible , a mapping must be set up first. ioremap() is designed specifically to assign virtual addresses to I/O memory regions. Once equipped with ioremap (and iounmap), a device driver can access any I/O memory address, whether or not it is really mapped to virtual address space.
    
    #include <asm/io.h>
    
    • void *ioremap(unsigned long phys_addr, unsigned long size)
    • void *ioremap_nocache(unsigned long phys_addr, unsigned long size)
    • It’s useful if some control registers are in such an area, and write combining or read caching is not desirable.
    • void iounmap(void * addr)

    Accessing I/O Memor

    The proper way of reading/writingt I/O memory is via a set of functions defined via <asm/io.h> :
    
    unsigned int ioread8(void *addr);
    unsigned int ioread16(void *addr);
    unsigned int ioread32(void *addr);
    void iowrite8(u8 value, void *addr);
    void iowrite16(u16 value, void *addr);
    void iowrite32(u32 value, void *addr);
    
    addr should be an address obtained from ioremap(). To read or write a series of values to a given I/O memory address:
    
    void ioread8_rep(void *addr, void *buf, unsigned long count);
    void ioread16_rep(void *addr, void *buf, unsigned long count);
    void ioread32_rep(void *addr, void *buf, unsigned long count);
    void iowrite8_rep(void *addr, const void *buf, unsigned long count);
    void iowrite16_rep(void *addr, const void *buf, unsigned long count);
    void iowrite32_rep(void *addr, const void *buf, unsigned long count);
    
    count is expressed in the size of the data being read/written. To read or write on a block of I/O memory:
    
    void memset_io(void *addr, u8 value, unsigned int count);
    void memcpy_fromio(void *dest, void *source, unsigned int count);
    void memcpy_toio(void *dest, void *source, unsigned int count);
    

    Ports as I/O Memory




    Chapter 10: Interrupt Handling


    The irq_domain interrupt number mapping library


    The current design of the Linux kernel uses a single large number space where each separate IRQ source is assigned a different number. This is called the Linux IRQ numbers.

    In the past, IRQ numbers could be chosen so they matched the hardware IRQ line into the root interrupt controller.
    Nowadays, this number is just a number because of cascading interrupt controllers.

    For this reason we need a mechanism to separate controller-local interrupt numbers, called hardware IRQ numbers, from Linux IRQ numbers.

    The irq_domain library adds mapping between hwirq and IRQ numbers on top of the irq_alloc_desc*() API.
    The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of irq numbers



    An interrupt controller driver creates and registers an irq_domain by calling one of the irq_domain_add_*() functions.

    Mappings are added to the irq_domain by calling irq_create_mapping() which accepts the irq_domain and a hwirq number as arguments.
    If a mapping for the hwirq doesn't already exist then it will allocate a new Linux irq_desc, associate it with the hwirq, and call the .map() callback so the driver can perform any
    required hardware setup.

    When an interrupt is received, irq_find_mapping() function should be used to find the Linux IRQ number from the hwirq number.

    If the driver has the Linux IRQ number or the irq_data pointer, and needs to know the associated hwirq number (such as in the irq_chip callbacks) then it can be directly obtained from irq_data->hwirq.

    There may be multiple interrupt controllers involved in delivering an interrupt from the device to the target CPU.
    On x86 platforms::
    Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
    
    To support such a hardware topology and make software architecture match hardware architecture, an irq_domain data structure is built for each interrupt controller and those irq_domains are organized into hierarchy.
    +----------------------------------------------------------------+
          Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU |
                  |      domain#2                         domain#2      domain#3   |
                  +---------------------------------------------------------+------+
                                  stacked irq_chip                         /|\
                                                                            |
                                                                           \|/
                                                                           
                                                                   Linux IRQ numbers         
        
    




    Linux generic IRQ handling



    Linux Inside


    Most real devices are a bit more complicated than direct I/O.
    Much has to be done in a time frame that is different from, and far slower than, that of the processor.
    There must be a way for a device to let the processor know when something has happened.
    An interrupt is simply an event raised by the software or hardware when it wants the processor’s attention.
    A device can use it to signal an interrupt to the CPU. However, interrupts are not signaled directly to the CPU.
    a PIC(Programmable Interrupt Controller) which is a chip responsible for sequentially processing multiple interrupt requests from multiple devices.
    In the new machines, there is an Advanced PIC. An APIC consists of two separate devices:
    • Local APIC
    • Local APIC is located on each CPU core. It is responsible for handling the CPU-specific interrupt configuration. It is usually used to manage interrupts from the APIC-timer, thermal sensor and any other such locally connected I/O devices.
    • I/O APIC
    • It provides multi-processor interrupt management. It is used to distribute external interrupts among the CPU cores.

    A driver needs to only register a handler for its device’s interrupts, and handle them properly when they arrive.
    Interrupt handlers, by their nature, run concurrently with other code. Thus, they inevitably raise issues of concurrency and contention for data structures and hardware.
    A solid understanding of concurrency control techniques is vital when working with interrupts.

    Addresses of interrupt handlers are maintained in a special location referred to as the - Interrupt Descriptor Table or IDT. CPU reacts on interrupt according to the IDT.
    IDT is an array of 8-byte descriptors, each entry is called gate. CPU multiplies vector number by 8 to find the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number by 16 to find the entry in the IDT.
    arch/x86/include/asm/segment.h:
    
    #define IDT_ENTRIES   256
    #define NUM_EXCEPTION_VECTORS  32
    
    arch/x86/kernel/idt.c:
    
    static void
    idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
    {
     gate_desc desc;
    
     for (; size > 0; t++, size--) {
      idt_init_desc(&desc, t);
      write_idt_entry(idt, t->vector, &desc);
      if (sys)
       set_bit(t->vector, system_vectors);
     }
    }
    
    static void set_intr_gate(unsigned int n, const void *addr)
    {
     struct idt_data data;
    
     BUG_ON(n > 0xFF);
    
     memset(&data, 0, sizeof(data));
     data.vector = n;
     data.addr = addr;
     data.segment = __KERNEL_CS;
     data.bits.type = GATE_INTERRUPT;
     data.bits.p = 1;
    
     idt_setup_from_table(idt_table, &data, 1, false);
    }
    
    void __init idt_setup_early_handler(void)
    {
     int i;
    
     for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
      set_intr_gate(i, early_idt_handler_array[i]);
    #ifdef CONFIG_X86_32
     for ( ; i < NR_VECTORS; i++)
      set_intr_gate(i, early_ignore_irq);
    #endif
     load_idt(&idt_descr);
    }
    
    
    load_idt() just executes lidt instruction:
    
        asm volatile("lidt %0"::"m" (idt_descr));
    


    The processor uses a unique number for recognizing the type of interruption/exception. This number is called the vector number.
    A vector number is an index in the IDT. There is a limited amount of the vector numbers and it can be from 0 to 255. The range-check upon the vector number in the Linux kernel source-code:
    
      BUG_ON( (unsigned)n > 0xFF );
    
    

    • The first 32 vector numbers from 0 to 31 are reserved by the processor
    • These are commonly used for the processing of architecture-defined exceptions and interrupts.
    • Vector numbers from 32 to 255 are used for user-defined interrupts.
    • These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor.

    Interrupts can be classified as maskable and non-maskable:
    • maskable
    • Maskable interrupts can be blocked with the two following instructions for x86_64:
      
      asm volatile("cli": : :"memory");
      
      asm volatile("sti": : :"memory");
      
      The sti instruction sets the IF flag and the cli instruction clears this flag.
    • non-maskable
    • Non-maskable interrupts are always reported. Usually any failure in the hardware is mapped to such non-maskable interrupts.



    Linux Device Driver Tutorial


    Interrupts in Linux Kernel


    What will happen when interrupt came?
    • Upon receiving an interrupt, the interrupt controller sends a signal to the processor.
    • The processor detects this signal and interrupts its current execution to handle the interrupt.
    • The processor can then notify the operating system that an interrupt has occurred, and the operating system can handle the interrupt appropriately.
    Interrupts are asynchronous interrupts generated by hardware.
    Exceptions are synchronous interrupts generated by the processor.
    System calls (one type of exception) on the x86 architecture are implemented by the issuance of a software interrupt, which traps into the kernel and causes execution of a special system call handler. Interrupts work in a similar way, except hardware (not software) issues interrupts.

    An interrupt handler or interrupt service routine (ISR) is the function that the kernel runs in response to a specific interrupt:
    • Each device that generates interrupts has an associated interrupt handler.
    • The interrupt handler for a device is part of the device’s driver (the kernel code that manages the device).
    Kernel invokes interrupt handlers in response to interrupts and that they run in a special context called interrupt context.

    Kernel code that services system calls issued by user applications runs on behalf of the corresponding application processes and is said to execute in process context. Interrupt handlers, on the other hand, run asynchronously in interrupt context.

    Kernel code running in process context is preemptible. An interrupt context, however, always runs to completion and is not preemptible. Code executing from interrupt context cannot do the following:
    • Go to sleep or relinquish the processor
    • Acquire a mutex
    • Perform time-consuming tasks
    • Access user space virtual memory
    The processing of interrupts is split into two parts, or halves:
    • Top halves
    • The top half will run immediately upon receipt of the interrupt and performs only the work that is time-critical, such as acknowledging receipt of the interrupt.
    • Bottom halves
    • A bottom half is used to process data, letting the top half to deal with new incoming interrupts. The bottom half runs in the future, at a more convenient time, with all interrupts enabled.
    There are 4 bottom half mechanisms are available in Linux:
    • Softirqs
    • Tasklets
    • Work-queue
    • Threaded IRQs

    Interrupts Example Program in Linux Kernel


    You should keep these following points in your mind before writing any interrupt program,
    • Interrupt handlers can not enter sleep, so to avoid calls to some functions which has sleep.
    • When the interrupt handler has part of the code to enter the critical section, use spinlocks lock, rather than mutexes. Because if it couldn’t take mutex it will go to sleep until it takes the mute.
    • Interrupt handlers can not exchange data with the user space.
    • The interrupt handlers must be executed as soon as possible. To ensure this, it is best to split the implementation into two parts, top half and bottom half. The top half of the handler will get the job done as soon as possible, and then work late on the bottom half, which can be done with softirqs or tasklets or workqueus.
    • Interrupt handlers can not be called repeatedly. When a handler is already executing, its corresponding IRQ must be disabled until the handler is done.
    • Interrupt handlers can be interrupted by higher authority handlers. If you want to avoid being interrupted by a highly qualified handlers, you can mark the interrupt handler as a fast handler. However, if too many are marked as fast handlers, the performance of the system will be degraded, because the interrupt latency will be longer.
    Functions Related to Interrupt:
    • request_irq( unsigned int irq, irq_handler_t handler, unsigned long flags, const char *name, void *dev_id)
    • Register an IRQ
    • free_irq(unsigned int irq, void *dev_id)
    • Release an IRQ registered by request_irq()
    • enable_irq(unsigned int irq)
    • Re-enable interrupt disabled by disable_irq or disable_irq_nosync.
    • disable_irq(unsigned int irq)
    • Disable an IRQ from issuing an interrupt.
    • disable_irq_nosync(unsigned int irq)
    • Disable an IRQ from issuing an interrupt, but wait until there is an interrupt handler being executed.
    • in_irq()
    • returns true when in interrupt handler
    • in_interrupt()
    • returns true when in interrupt handler or bottom half
    The parameter flags defined in linux/interrupt.h. The most important of these flags are:
    • IRQF_DISABLED
    • When set, this flag instructs the kernel to disable all interrupts when executing this interrupt handler. Most interrupt handlers do not set this flag. Its use is reserved for performance-sensitive interrupts that needs to be executed quickly.
    • IRQF_SAMPLE_RANDOM
    • IRQF_SHARED
    • This flag specifies that the interrupt line can be shared among multiple interrupt handlers
    • IRQF_TIMER
    • This flag specifies that this handler processes interrupts for the system timer.
    Intel processors handle interrupt using IDT (Interrupt Descriptor Table) .
    The IDT consists of 256 entries with each entry corresponding to a vector with 8 bytes.


    An interrupt can be raised using ‘int’ instruction by software.
    In linux, IRQ to vector mapping is done in arch/x86/include/asm/irq_vectors.h:

    
    /*
     * Linux IRQ vector layout.
     *
     * There are 256 IDT entries (per CPU - each entry is 8 bytes) which can
     * be defined by Linux. They are used as a jump table by the CPU when a
     * given vector is triggered - by a CPU-external, CPU-internal or
     * software-triggered event.
     *
     * Linux sets the kernel code address each entry jumps to early during
     * bootup, and never changes them. This is the general layout of the
     * IDT entries:
     *
     *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
     *  Vectors  32 ... 127 : device interrupts
     *  Vector  128         : legacy int80 syscall interface
     *  Vectors 129 ... LOCAL_TIMER_VECTOR-1
     *  Vectors LOCAL_TIMER_VECTOR ... 255 : special interrupts
     *
     * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
     *
     * This file enumerates the exact layout of them:
     */
    
    /*
     * IDT vectors usable for external interrupt sources start at 0x20.
     * (0x80 is the syscall vector, 0x30-0x3f are for ISA)
     */
    #define FIRST_EXTERNAL_VECTOR  0x20
    
    
    IRQ 0x20 ~ 0x30 are reserved by Intel, therefore, the IRQ#0 is mapped to vector using the macro,
    
    #define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
    
    IRQ#11 is :
    
     IRQ0_VECTOR + 11
    
    To raise interrupt IRQ#11,
    
      asm("int $0x3B")
    
    • sim_irq/Makefile
    • 
      ifneq ($(KERNELRELEASE),)
      obj-m += sim_irq.o
      else
      KERNELDIR ?= /lib/modules/$(shell uname -r)/build
      PWD  := $(shell pwd)
      default:
              $(MAKE) -C $(KERNELDIR) M=$(PWD) modules
      endif
      
      
    • sim_irq/sim_irq.c
    • 
      #include <linux/kernel.h&>t;
      #include <linux/init.h&>t;
      #include <linux/module.h&>t;
      #include <linux/kdev_t.h&>t;
      #include <linux/fs.h&>t;
      #include <linux/cdev.h&>t;
      #include <linux/device.h&>t;
      #include<linux/slab.h&>t;                 //kmalloc()
      #include<linux/uaccess.h&>t;              //copy_to/from_user()
      #include<linux/sysfs.h&>t; 
      #include<linux/kobject.h&>t; 
      #include <linux/interrupt.h&>t;
      #include <asm/io.h&>t;
       
      #define IRQ_NO 11
       
      //Interrupt handler for IRQ 11. 
       
      static irqreturn_t irq_handler(int irq,void *dev_id) {
        printk(KERN_INFO "Shared IRQ: Interrupt Occurred");
        return IRQ_HANDLED;
      }
       
       
      volatile int etx_value = 0;
       
       
      dev_t dev = 0;
      static struct class *dev_class;
      static struct cdev etx_cdev;
      struct kobject *kobj_ref;
       
      static int __init etx_driver_init(void);
      static void __exit etx_driver_exit(void);
       
      /*************** Driver Fuctions **********************/
      static int etx_open(struct inode *inode, struct file *file);
      static int etx_release(struct inode *inode, struct file *file);
      static ssize_t etx_read(struct file *filp, 
                      char __user *buf, size_t len,loff_t * off);
      static ssize_t etx_write(struct file *filp, 
                      const char *buf, size_t len, loff_t * off);
       
      /*************** Sysfs Fuctions **********************/
      static ssize_t sysfs_show(struct kobject *kobj, 
                      struct kobj_attribute *attr, char *buf);
      static ssize_t sysfs_store(struct kobject *kobj, 
                      struct kobj_attribute *attr,const char *buf, size_t count);
       
      struct kobj_attribute etx_attr = __ATTR(etx_value, 0660, sysfs_show, sysfs_store);
       
      static struct file_operations fops =
      {
              .owner          = THIS_MODULE,
              .read           = etx_read,
              .write          = etx_write,
              .open           = etx_open,
              .release        = etx_release,
      };
       
      static ssize_t sysfs_show(struct kobject *kobj, 
                      struct kobj_attribute *attr, char *buf)
      {
              printk(KERN_INFO "Sysfs - Read!!!\n");
              return sprintf(buf, "%d", etx_value);
      }
       
      static ssize_t sysfs_store(struct kobject *kobj, 
                      struct kobj_attribute *attr,const char *buf, size_t count)
      {
              printk(KERN_INFO "Sysfs - Write!!!\n");
              sscanf(buf,"%d",&etx_value);
              return count;
      }
       
      static int etx_open(struct inode *inode, struct file *file)
      {
              printk(KERN_INFO "Device File Opened...!!!\n");
              return 0;
      }
       
      static int etx_release(struct inode *inode, struct file *file)
      {
              printk(KERN_INFO "Device File Closed...!!!\n");
              return 0;
      }
       
      static ssize_t etx_read(struct file *filp, 
                      char __user *buf, size_t len, loff_t *off)
      {
              printk(KERN_INFO "Read function\n");
              asm("int $0x3B");  // Corresponding to irq 11
              return 0;
      }
      static ssize_t etx_write(struct file *filp, 
                      const char __user *buf, size_t len, loff_t *off)
      {
              printk(KERN_INFO "Write Function\n");
              return 0;
      }
       
       
      static int __init etx_driver_init(void)
      {
              /*Allocating Major number*/
              if((alloc_chrdev_region(&dev, 0, 1, "etx_Dev")) <0){
                      printk(KERN_INFO "Cannot allocate major number\n");
                      return -1;
              }
              printk(KERN_INFO "Major = %d Minor = %d \n",MAJOR(dev), MINOR(dev));
       
              /*Creating cdev structure*/
              cdev_init(&etx_cdev,&fops);
       
              /*Adding character device to the system*/
              if((cdev_add(&etx_cdev,dev,1)) < 0){
                  printk(KERN_INFO "Cannot add the device to the system\n");
                  goto r_class;
              }
       
              /*Creating struct class*/
              if((dev_class = class_create(THIS_MODULE,"etx_class")) == NULL){
                  printk(KERN_INFO "Cannot create the struct class\n");
                  goto r_class;
              }
       
              /*Creating device*/
              if((device_create(dev_class,NULL,dev,NULL,"etx_device")) == NULL){
                  printk(KERN_INFO "Cannot create the Device 1\n");
                  goto r_device;
              }
       
              /*Creating a directory in /sys/kernel/ */
              kobj_ref = kobject_create_and_add("etx_sysfs",kernel_kobj);
       
              /*Creating sysfs file for etx_value*/
              if(sysfs_create_file(kobj_ref,&etx_attr.attr)){
                      printk(KERN_INFO"Cannot create sysfs file......\n");
                      goto r_sysfs;
              }
              if (request_irq(IRQ_NO, irq_handler, IRQF_SHARED, "etx_device", (void *)(irq_handler))) {
                  printk(KERN_INFO "my_device: cannot register IRQ ");
                          goto irq;
              }
              printk(KERN_INFO "Device Driver Insert...Done!!!\n");
          return 0;
       
      irq:
              free_irq(IRQ_NO,(void *)(irq_handler));
       
      r_sysfs:
              kobject_put(kobj_ref); 
              sysfs_remove_file(kernel_kobj, &etx_attr.attr);
       
      r_device:
              class_destroy(dev_class);
      r_class:
              unregister_chrdev_region(dev,1);
              cdev_del(&etx_cdev);
              return -1;
      }
       
      void __exit etx_driver_exit(void)
      {
              free_irq(IRQ_NO,(void *)(irq_handler));
              kobject_put(kobj_ref); 
              sysfs_remove_file(kernel_kobj, &etx_attr.attr);
              device_destroy(dev_class,dev);
              class_destroy(dev_class);
              cdev_del(&etx_cdev);
              unregister_chrdev_region(dev, 1);
              printk(KERN_INFO "Device Driver Remove...Done!!!\n");
      }
       
      module_init(etx_driver_init);
      module_exit(etx_driver_exit);
       
      MODULE_LICENSE("GPL");
      MODULE_AUTHOR("EmbeTronicX <embetronicx@gmail.com or admin@embetronicx.com&>t;");
      MODULE_DESCRIPTION("A simple device driver - Interrupts");
      MODULE_VERSION("1.9");
      
    • Build the driver
    • 
      $ sudo make
      
    • Load the driver
    • 
      $ sudo insmod sim_irq.ko
      
      [ 2781.101790] Major = 239 Minor = 0 
      [ 2781.101959] Device Driver Insert...Done!!!
      
      
    • To trigger interrupt
    • 
      $ sudo cat /dev/etx_device
      
    • Now see the Dmesg
    • 
      $ dmesg
      ...
      [ 2872.619249] Device File Opened...!!!
      [ 2872.619260] Read function
      [ 2872.619262] do_IRQ: 1.59 No irq handler for vector
      [ 2872.619273] Device File Closed...!!!
      
      
    arch/x86/kernel/irq.c:
    
    
    __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
    {
    ...
     unsigned vector = ~regs->orig_ax;
    ...
     desc = __this_cpu_read(vector_irq[vector]);
    ...
      if (desc == VECTOR_UNUSED) {
                        pr_emerg_ratelimited("%s: %d.%d No irq handler for vector\n",
                        __func__, smp_processor_id(),
                          vector);
                    } else {
                         __this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
                    }
    ...
    }
    
    Clues:
    • "request_irq(11,...)" and the interrupt vector received is 59(0x 3B)
    • arch/x86/include/asm/irq_vectors.h:
      
      #define NR_VECTORS 256
      
      arch/x86/include/asm/hw_irq.h:
      
      #define VECTOR_UNUSED  NULL
      
      typedef struct irq_desc* vector_irq_t[NR_VECTORS];
      DECLARE_PER_CPU(vector_irq_t, vector_irq);
      
      Is the IRQ# to interrupt vector mapping correct?
      
       0x3B = 0x20 + 0x10 + 11
      
    • $ cat /proc/interrupts
    • 
       11:          0          0          0          0   IO-APIC  11-edge      sim_irq
      
      


    Preparing the Parallel Port


    Installing an Interrupt Handler


    If the Linux kernel hasn’t been told to expect your interrupt, it simply acknowledges and ignores it.
    The kernel keeps a registry of interrupt lines, similar to the registry of I/O ports.
    The interrupt registration interface declared in <linux/interrupt.h>:
    
    int request_irq(unsigned int irq,
                    irqreturn_t (*handler)(int, void *, struct pt_regs *),
                    unsigned long flags,
                    const char *dev_name, 
                    void *dev_id);
    
    void free_irq(unsigned int irq, void *dev_id);
    
    
    This call allocates interrupt resources and enables the interrupt line and IRQ handling.
    Where:
    • irq
    • The interrupt line number being requested.
    • handler
    • The pointer to the handling function being installed.
    • flags
    • a bit mask of options related to interrupt management.
      • IRQF_SHARED
      • allow sharing the irq among several devices. If this flag is not set, then if there is already a handler associated with the requested interrupt, the request for interrupt will fail. A shared interrupt is handled in a special way by the kernel: all of the associated interrupt handlers will be executed until the device that generated the interrupt will be identified. How can a device driver know if the interrupt handling routine was activated by an interrupt generated by the device it manages? All devices that offer interrupt support have a status register which can be checked by the handling routine to see if the interrupt was or was not generated by the device.
      • IRQF_PROBE_SHARED
      • set by callers when they expect sharing mismatches to occur
      • IRQF_TIMER
      • Flag to mark this interrupt as timer interrupt
      • IRQF_PERCPU
      • Interrupt is per cpu
      • IRQF_NOBALANCING
      • Flag to exclude this interrupt from irq balancing
      • IRQF_IRQPOLL
      • Interrupt is used for polling (only the interrupt that is registered first in a shared interrupt is considered for performance reasons)
      • IRQF_ONESHOT
      • Interrupt is not reenabled after the hardirq handler finished. Used by threaded interrupts which need to keep the irq line disabled until the threaded handler has been run.
      • IRQF_NO_SUSPEND
      • Do not disable this IRQ during suspend. Does not guarantee that this interrupt will wake the system from a suspended state. See Documentation/power/suspend-and-interrupts.rst
      • IRQF_FORCE_RESUME
      • Force enable it on resume even if IRQF_NO_SUSPEND is set
      • IRQF_NO_THREAD
      • Interrupt cannot be threaded
      • IRQF_EARLY_RESUME
      • Resume IRQ early during syscore instead of at device resume time.
      • IRQF_COND_SUSPEND
      • If the IRQ is shared with a NO_SUSPEND user, execute this interrupt handler after suspending interrupts. For system wakeup devices users need to implement wakeup detection in their interrupt handlers.
    • *dev_name
    • The string passed to request_irq is used in /proc/interrupts to show the owner of the interrupt
    • dev_id
    • When requesting a shared interrupt, the dev_id argument must be unique and it must not be NULL. Usually it is set to module’s private data. This is used when the interrupt line is freed and that may also be used by the driver to point to its own private data area (to identify which device is interrupting). If the interrupt is not shared, dev_id can be set to NULL , but it a good idea anyway to use this item to point to the device structure

    The number of interrupt lines is limited, installing the interrupt handler from within the module’s initialization function might not a good idea.
    Requesting the interrupt at device open, on the other hand, allows some sharing of resources.
    • The correct place to call request_irq is when the device is first opened, before the hardware is instructed to generate interrupts.
    • The place to call free_irq is the last time the device is closed, after the hardware is told not to interrupt the processor any
      more.

    The /proc Interface


    Reported interrupts are shown in /proc/interrupts.
    
               CPU0       CPU1       CPU2       CPU3       
      0:         32          0          0          0   IO-APIC   2-edge      timer
      1:          0          0      18154          0   IO-APIC   1-edge      i8042
      8:          0          0          0          1   IO-APIC   8-edge      rtc0
    ...
    
    
    The first column is the IRQ number.
    The /proc/interrupts display shows how many interrupts have been delivered to each CPU on the system.
    /proc/stat includes (but not limited to) the number of interrupts received since system boot. Each line of stat begins with a text string that is the key to the line; the intr mark is what we are looking for.
    
    ...
    intr 641582490 32 18356
    ...
    
    The first number is the total of all interrupts, while each of the others represents a single IRQ line, starting with interrupt 0 .

    Autodetecting the IRQ Number


    Autodetection of the interrupt number is a basic requirement for driver usability.
    How to determine which IRQ line is going to be used by the device:
    • the user specifies the interrupt number at load time is a bad idea
    • the driver retrieves the interrupt number told by the device
    • This is done by reading a status byte from one of the device’s I/O ports or PCI configuration space. Probing interrupt turns out to be probing device.
    • the driver tells the device to generate interrupts and watches what happens.
    • If everything goes well, only one interrupt line is activated.

    Kernel-assisted probing


    For only non-shared interrupts, kernel offers a low-level facility for probing the interrupt number.
    • unsigned long probe_irq_on(void)
    • This function returns a bit mask of unassigned interrupts
    • int probe_irq_off(unsigned long)
    • This function returns the number of the interrupt that was issued after “probe_on()”. If no interrupts occurred, 0 is returned. (IRQ 0 can’t be probed for) It returns a negative value if more than one interrupt occurred (ambiguous detection)
    The driver's probing process:
    • call probe_irq_on()
    • enable interrupts on the probed device
    • disable interrupts on the probed device
    • call probe_irq_off()
    
    unsigned long mask;
    mask = probe_irq_on( );
    /* enable interrupts then disable interrupts on the probed device. */
    udelay(5); /* give it some time */
    short_irq = probe_irq_off(mask);
    if (short_irq == 0) /* none of them? */
        printk(KERN_INFO "short: no irq reported by probe\n");
    }
    

    Do-it-yourself probing


    Often a device can be configured to use one IRQ number from a set of three or four; probing just those IRQs enables us to detect the right one, without having to test for all possible IRQs.

    To probe for all interrupts, you have to probe from IRQ 0 to IRQ NR_IRQS-1 , where NR_IRQS is defined in arch/arm/include/asm/irq.h and is platform dependent.

    The handler’s role is to update
    short_irq according to which interrupts are actually received.
    
    irqreturn_t short_probing(int irq, void *dev_id, struct pt_regs *regs) 
    {
    ...
    }
    
    The argument irq is the interrupt number being handled by the registered handler.

    Fast and Slow Handlers


    The internals of interrupt handling on the x86


    The lowest level of interrupt handling can be found in arch/i386/kernel/entry.S:
    • an assembly-language file that handles much of the machine-level work
    • a bit of code is assigned to every possible interrupt
    • In each interrupt case, the code pushes the interrupt number on the stack and jumps to a common segment, which calls do_IRQ(), defined in arch/x86/kernel/irq.c.
    • 
      __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
      
      This looks up the handler(s) for this particular IRQ. If there is no handler, there’s nothing to do;


    Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)


    Interrupts may have different important characteristics and there are two among them:
    • Handler of an interrupt must execute quickly
    • Sometime an interrupt handler must do a large amount of work.

    Because of these, the handling of interrupts is split into two parts previously:
    • Top half;
    • Bottom half;
    Now this term has remained as a common noun referring to all the different ways of organizing deferred processing of an interrupt.
    There are three types of deferred interrupts in the Linux kernel:
    • softirqs;
    • tasklets;
    • workqueues;



    Implementing a Handler


    A handler can’t transfer data to or from user space, because it doesn’t execute in the context of a process.
    The role of an interrupt handler is to give feedback to its device about interrupt reception and to read or write data according to the meaning of the interrupt being serviced.
    Most hardware devices won’t generate other interrupts until their “interrupt-pending” bit has been cleared. The interrupt handler usually to clear a “interrupt-pending” bit on the device's status register.
    A typical task for an interrupt handler is awakening processes sleeping on the device if the interrupt signals the event they’re waiting for, such as the arrival of new data.

    Top and Bottom Halves


    One of the main problems with interrupt handling is how to perform lengthy tasks within a handler because interrupt handlers need to finish up quickly and not keep interrupts
    Linux (along with many other systems) resolves this problem by splitting the interrupt handler into two halves:
    • top half
    • the routine that actually responds to the interrupt — the one you register with request_irq().
    • bottom half
    • a routine that is scheduled by the top half to be executed later, at a safer time. Interrupts are enabled during execution of the bottom half.

    The Linux kernel has two different mechanisms that may be used to implement bottom-half processing:
    • Tasklets are often the preferred mechanism for bottom-half processing; they are very fast, but all tasklet code must be atomic.
    • The alternative to tasklets is workqueues, which may have a higher latency but that are allowed to sleep.


留言

熱門文章