Linux Watchdog

Linux Kernel Watchdog Explained

The Linux kernel watchdog is used to monitor if a system is running.
The watchdog module is specific to the hardware or chip being used.
It is useful for systems that are mission critical and need the ability to reboot themselves without human intervention.

Wrong configurations of a watchdog on your system can cause problems like:

  • Endless reboot loop
  • File corruption due to hard reset
  • Unpredictable random reboots
To use watchdog peripheral in Linux, you will need:
  • Watchdog driver
  • Watchdog device node
  • You can recreate it if it doesn't exists:
    
    $ sudo mknod /dev/watchdog c 10 130
    	

Watchdog Module

Watchdog functionality on the hardware side sets up a timer that times out after a predetermined period. The watchdog software then periodically refreshes the hardware timer.
If the software stops refreshing, then after the predetermined period, the timer performs a hardware reset of the device.

In order for a watchdog timer to be functional, the motherboard manufacturer has to use the chip’s watchdog functionality.
Also, you need the right watchdog kernel module to be loaded in your Linux system. Different chips use different modules.
For example:

  • Intel chipsets might use the “iTCO_wdt” module
  • HP hardware might use “hpwdt”
  • IBM mainframes might use “vmwatchdog”
  • Xen VM might use “xen_wdt”
After the module is loaded, you can check /dev/watchdog on the Linux system. If this file is present, that means the watchdog kernel device driver or module was loaded.

  • if no application opens the /dev/watchdog file, then the kernel takes care of resetting the watchdog.
  • The watchdog module is a timer, it won't appear as a dedicated kernel thread, but handled by the soft IRQ thread.
  • if an application opens this device file, it becomes responsible of the watchdog, and can reset it by writing to the file
  • The system(watchdog daemon) periodically keeps writing to /dev/watchdog. It is also called “kicking or feeding the watchdog”.
    If the system fails to kick or feed the watchdog, then after a while the system is hard reset by the hardware watchdog.

Watchdog Daemon

It opens /dev/watchdog, and keeps writing to it often enough to keep the kernel from resetting, at least once per minute.
You can do it in one of these two ways:
  • Write any character into ‘/dev/watchdog’.
  • Don’t write ‘V’ character
  • Use IOCTL to insert ‘WDIOC_KEEPALIVE’ value.
Each write delays the reboot time another minute.
After a minute of inactivity the watchdog hardware will cause the reset.

It will test the system resources such as memory usage, if the tests fail so that the watchdog is not refreshed, then watchdog causes a shutdown.

watchdog is such a daemon:

  • The watchdog program writes to /dev/watchdog every ten seconds.
  • If the device is opened but not written to within a minute, the machine will reboot.
  • This feature is available when the kernel is built with "software watchdog" support (standard in Debian kernels) or if the machine is equipped with a hardware watchdog
  • The kernel software watchdog's ability to reboot will depend on the state of the machine and interrupts.
  • The watchdog tool itself runs several health checks and acts appropriately if the system is not in good shape.

Starting and Stopping Watchdog

The watchdog is automatically started once you open /dev/watchdog.
After watchdog starts,
  • It puts itself into the background and then tries all checks specified in its configuration file in turn.
  • Between each two tests it will write to the kernel device to prevent a reset.
  • After finishing all tests watchdog goes to sleep for some time.
  • watchdog will sleep for a configure interval that defaults to 1 second to make sure it triggers the device early enough.
  • The kernel drivers expects a write to the watchdog device every minute.
The watchdog daemon can be stopped without causing a reboot if the device /dev/watchdog is closed correctly. ( unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled , if this option is enabled, the watchdog cannot be stopped at all.)
  1. Write character V into /dev/watchdog to prevent stopping the watchdog accidentally
  2. Close the /dev/watchdog file

Test the Linux watchdog

Testing the hardware watchdog

If you want to test if the hardware watchdog is working,

sudo cat >> /dev/watchdog
And press “enter” twice and wait.(The prompt will not come back)
After awhile (depending on your kernel’s setting), the system should perform the hard reboot because there is continuous writing to the device file.

Example code to control the watchdog

  • Set the watchdog timeout.
  • Use IOCTL with WDIOC_SETTIMEOUT
  • Get the current watchdog timeout.
  • Use IOCTL with WDIOC_GETTIMEOUT
  • Check if the last boot is caused by watchdog or it is power-on-reset.
  • Use IOCTL with WDIOC_GETBOOTSTATUS.

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <getopt.h>
#include <string.h>
#include <errno.h>
 
#include <linux/watchdog.h>
 
#define WATCHDOGDEV "/dev/watchdog"
static const char *const short_options = "hd:i:";
static const struct option long_options[] = {
   {"help", 0, NULL, 'h'},
   {"dev", 1, NULL, 'd'},
   {"interval", 1, NULL, 'i'},
   {NULL, 0, NULL, 0},
};
 
static void print_usage(FILE * stream, char *app_name, int exit_code)
{
   fprintf(stream, "Usage: %s [options]\n", app_name);
   fprintf(stream,
      " -h  --help                Display this usage information.\n"
      " -d  --dev <device_file>   Use <device_file> as watchdog device file.\n"
      "                           The default device file is '/dev/watchdog'\n"
      " -i  --interval <interval> Change the watchdog interval time\n");
 
   exit(exit_code);
}
 
int main(int argc, char **argv)
{
   int fd;         /* File handler for watchdog */
   int interval;      /* Watchdog timeout interval (in secs) */
   int bootstatus;      /* Wathdog last boot status */
   char *dev;      /* Watchdog default device file */
 
   int next_option;   /* getopt iteration var */
   char kick_watchdog;   /* kick_watchdog options */
 
   /* Init variables */
   dev = WATCHDOGDEV;
   interval = 0;
   kick_watchdog = 0;
 
   /* Parse options if any */
   do {
      next_option = getopt_long(argc, argv, short_options,
                 long_options, NULL);
      switch (next_option) {
      case 'h':
         print_usage(stdout, argv[0], EXIT_SUCCESS);
      case 'd':
         dev = optarg;
         break;
      case 'i':
         interval = atoi(optarg);
         break;
      case '?':   /* Invalid options */
         print_usage(stderr, argv[0], EXIT_FAILURE);
      case -1:   /* Done with options */
         break;
      default:   /* Unexpected stuffs */
         abort();
      }
   } while (next_option != -1);
 
   /* Once the watchdog device file is open, the watchdog will be activated by
      the driver */
   fd = open(dev, O_RDWR);
   if (-1 == fd) {
      fprintf(stderr, "Error: %s\n", strerror(errno));
      exit(EXIT_FAILURE);
   }
 
   /* If user wants to change the watchdog interval */
   if (interval != 0) {
      fprintf(stdout, "Set watchdog interval to %d\n", interval);
      if (ioctl(fd, WDIOC_SETTIMEOUT, &interval) != 0) {
         fprintf(stderr,
            "Error: Set watchdog interval failed\n");
         exit(EXIT_FAILURE);
      }
   }
 
   /* Display current watchdog interval */
   if (ioctl(fd, WDIOC_GETTIMEOUT, &interval) == 0) {
      fprintf(stdout, "Current watchdog interval is %d\n", interval);
   } else {
      fprintf(stderr, "Error: Cannot read watchdog interval\n");
      exit(EXIT_FAILURE);
   }
 
   /* Check if last boot is caused by watchdog */
   if (ioctl(fd, WDIOC_GETBOOTSTATUS, &bootstatus) == 0) {
      fprintf(stdout, "Last boot is caused by : %s\n",
         (bootstatus != 0) ? "Watchdog" : "Power-On-Reset");
   } else {
      fprintf(stderr, "Error: Cannot read watchdog status\n");
      exit(EXIT_FAILURE);
   }
 
   /* There are two ways to kick the watchdog:
      - by writing any dummy value into watchdog device file, or
      - by using IOCTL WDIOC_KEEPALIVE
    */
   fprintf(stdout,
      "Use:\n"
      " <w> to kick through writing over device file\n"
      " <i> to kick through IOCTL\n" " <x> to exit the program\n");
   do {
      kick_watchdog = getchar();
      switch (kick_watchdog) {
      case 'w':
         write(fd, "w", 1);
         fprintf(stdout,
            "Kick watchdog through writing over device file\n");
         break;
      case 'i':
         ioctl(fd, WDIOC_KEEPALIVE, NULL);
         fprintf(stdout, "Kick watchdog through IOCTL\n");
         break;
      case 'x':
         fprintf(stdout, "Goodbye !\n");
         break;
      default:
         fprintf(stdout, "Unknown command\n");
         break;
      }
   } while (kick_watchdog != 'x');
 
   /* The 'V' value needs to be written into watchdog device file to indicate
      that we intend to close/stop the watchdog. Otherwise, debug message
      'Watchdog timer closed unexpectedly' will be printed
    */
   write(fd, "V", 1);
   /* Closing the watchdog device will deactivate the watchdog. */
   close(fd);
}

Setup Watchdog

Systemd watchdog service

Systemd provides full support for hardware watchdogs (as exposed in /dev/watchdog to userspace), as well as supervisor (software) watchdog support.

To make use of the hardware watchdog it is sufficient to set the RuntimeWatchdogSec= option in /etc/systemd/system.conf.


  #RuntimeWatchdogSec=0
It defaults to 0 (i.e. no hardware watchdog use). Set it to a value like 20s and the watchdog is enabled.
Note that systemd will send a ping to the hardware at half the specified interval, i.e. every 10s.

Note that the hardware watchdog device (/dev/watchdog) is single-user only. That means that you can either enable this functionality in systemd, or use a separate external watchdog daemon.

Enable Watchdog

The watchdog is disabled by default. You can use the commands below to enable it.

$ sudo systemctl enable watchdog.service
$ sudo systemctl start watchdog.service

Check Watchdog Service


$ systemctl status watchdog.service 

Setup Watchdog Timeout

The default timeout is 15S, you can edit file /etc/watchdog.conf to change the timeout watchdog-timeout.
Then , restart the service:

$ sudo systemctl restart watchdog.service

Test the watchdog

  • Trigger a Kernel Crash
  • Crashes the system without first unmounting file systems or syncing disks attached to the system
    
    $ sudo -i
    # echo c > /proc/sysrq-trigger 
    	
  • Kill Watchdog Daemon
  • Kill the watchdog daemon to prevent it to feed the watchdog, so the system will reboot after the timeout set in /etc/watchdog.conf.

Enable the software watchdog logic for a service

  • The service unit file must include the clause WATCHDOG_USEC:
  • 
    [Unit]
    Description=My Little Daemon
    Documentation=man:mylittled(8)
    
    [Service]
    ExecStart=/usr/bin/mylittled
    WatchdogSec=30s
    Restart=on-failure
    StartLimitInterval=5min
    StartLimitBurst=4
    StartLimitAction=reboot-force
    
  • The service should then issue sd_notify("WATCHDOG=1") calls every half of that interval.

留言

熱門文章