Linux Watchdog
Linux Kernel Watchdog Explained
The Linux kernel watchdog is used to monitor if a system is running.The watchdog module is specific to the hardware or chip being used.
It is useful for systems that are mission critical and need the ability to reboot themselves without human intervention.
Wrong configurations of a watchdog on your system can cause problems like:
- Endless reboot loop
- File corruption due to hard reset
- Unpredictable random reboots
- Watchdog driver
- Watchdog device node You can recreate it if it doesn't exists:
$ sudo mknod /dev/watchdog c 10 130
Watchdog Module
Watchdog functionality on the hardware side sets up a timer that times out after a predetermined period. The watchdog software then periodically refreshes the hardware timer.If the software stops refreshing, then after the predetermined period, the timer performs a hardware reset of the device.
In order for a watchdog timer to be functional, the motherboard manufacturer has to use the chip’s watchdog functionality.
Also, you need the right watchdog kernel module to be loaded in your Linux system. Different chips use different modules.
For example:
- Intel chipsets might use the “iTCO_wdt” module
- HP hardware might use “hpwdt”
- IBM mainframes might use “vmwatchdog”
- Xen VM might use “xen_wdt”
- if no application opens the /dev/watchdog file, then the kernel takes care of resetting the watchdog. The watchdog module is a timer, it won't appear as a dedicated kernel thread, but handled by the soft IRQ thread.
- if an application opens this device file, it becomes responsible of the watchdog, and can reset it by writing to the file The system(watchdog daemon) periodically keeps writing to /dev/watchdog. It is also called “kicking or feeding the watchdog”.
If the system fails to kick or feed the watchdog, then after a while the system is hard reset by the hardware watchdog.
Watchdog Daemon
It opens /dev/watchdog, and keeps writing to it often enough to keep the kernel from resetting, at least once per minute.You can do it in one of these two ways:
- Write any character into ‘/dev/watchdog’. Don’t write ‘V’ character
- Use IOCTL to insert ‘WDIOC_KEEPALIVE’ value.
After a minute of inactivity the watchdog hardware will cause the reset.
It will test the system resources such as memory usage, if the tests fail so that the watchdog is not refreshed, then watchdog causes a shutdown.
watchdog is such a daemon:
- The watchdog program writes to /dev/watchdog every ten seconds.
- If the device is opened but not written to within a minute, the machine will reboot.
- This feature is available when the kernel is built with "software watchdog" support (standard in Debian kernels) or if the machine is equipped with a hardware watchdog The kernel software watchdog's ability to reboot will depend on the state of the machine and interrupts.
- The watchdog tool itself runs several health checks and acts appropriately if the system is not in good shape.
Starting and Stopping Watchdog
The watchdog is automatically started once you open /dev/watchdog.After watchdog starts,
- It puts itself into the background and then tries all checks specified in its configuration file in turn.
- Between each two tests it will write to the kernel device to prevent a reset.
- After finishing all tests watchdog goes to sleep for some time. watchdog will sleep for a configure interval that defaults to 1 second to make sure it triggers the device early enough.
- The kernel drivers expects a write to the watchdog device every minute.
- Write character V into /dev/watchdog to prevent stopping the watchdog accidentally
- Close the /dev/watchdog file
Test the Linux watchdog
Testing the hardware watchdog
If you want to test if the hardware watchdog is working,sudo cat >> /dev/watchdogAnd press “enter” twice and wait.(The prompt will not come back)
After awhile (depending on your kernel’s setting), the system should perform the hard reboot because there is continuous writing to the device file.
Example code to control the watchdog
- Set the watchdog timeout. Use IOCTL with WDIOC_SETTIMEOUT
- Get the current watchdog timeout. Use IOCTL with WDIOC_GETTIMEOUT
- Check if the last boot is caused by watchdog or it is power-on-reset. Use IOCTL with WDIOC_GETBOOTSTATUS.
#include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <getopt.h> #include <string.h> #include <errno.h> #include <linux/watchdog.h> #define WATCHDOGDEV "/dev/watchdog" static const char *const short_options = "hd:i:"; static const struct option long_options[] = { {"help", 0, NULL, 'h'}, {"dev", 1, NULL, 'd'}, {"interval", 1, NULL, 'i'}, {NULL, 0, NULL, 0}, }; static void print_usage(FILE * stream, char *app_name, int exit_code) { fprintf(stream, "Usage: %s [options]\n", app_name); fprintf(stream, " -h --help Display this usage information.\n" " -d --dev <device_file> Use <device_file> as watchdog device file.\n" " The default device file is '/dev/watchdog'\n" " -i --interval <interval> Change the watchdog interval time\n"); exit(exit_code); } int main(int argc, char **argv) { int fd; /* File handler for watchdog */ int interval; /* Watchdog timeout interval (in secs) */ int bootstatus; /* Wathdog last boot status */ char *dev; /* Watchdog default device file */ int next_option; /* getopt iteration var */ char kick_watchdog; /* kick_watchdog options */ /* Init variables */ dev = WATCHDOGDEV; interval = 0; kick_watchdog = 0; /* Parse options if any */ do { next_option = getopt_long(argc, argv, short_options, long_options, NULL); switch (next_option) { case 'h': print_usage(stdout, argv[0], EXIT_SUCCESS); case 'd': dev = optarg; break; case 'i': interval = atoi(optarg); break; case '?': /* Invalid options */ print_usage(stderr, argv[0], EXIT_FAILURE); case -1: /* Done with options */ break; default: /* Unexpected stuffs */ abort(); } } while (next_option != -1); /* Once the watchdog device file is open, the watchdog will be activated by the driver */ fd = open(dev, O_RDWR); if (-1 == fd) { fprintf(stderr, "Error: %s\n", strerror(errno)); exit(EXIT_FAILURE); } /* If user wants to change the watchdog interval */ if (interval != 0) { fprintf(stdout, "Set watchdog interval to %d\n", interval); if (ioctl(fd, WDIOC_SETTIMEOUT, &interval) != 0) { fprintf(stderr, "Error: Set watchdog interval failed\n"); exit(EXIT_FAILURE); } } /* Display current watchdog interval */ if (ioctl(fd, WDIOC_GETTIMEOUT, &interval) == 0) { fprintf(stdout, "Current watchdog interval is %d\n", interval); } else { fprintf(stderr, "Error: Cannot read watchdog interval\n"); exit(EXIT_FAILURE); } /* Check if last boot is caused by watchdog */ if (ioctl(fd, WDIOC_GETBOOTSTATUS, &bootstatus) == 0) { fprintf(stdout, "Last boot is caused by : %s\n", (bootstatus != 0) ? "Watchdog" : "Power-On-Reset"); } else { fprintf(stderr, "Error: Cannot read watchdog status\n"); exit(EXIT_FAILURE); } /* There are two ways to kick the watchdog: - by writing any dummy value into watchdog device file, or - by using IOCTL WDIOC_KEEPALIVE */ fprintf(stdout, "Use:\n" " <w> to kick through writing over device file\n" " <i> to kick through IOCTL\n" " <x> to exit the program\n"); do { kick_watchdog = getchar(); switch (kick_watchdog) { case 'w': write(fd, "w", 1); fprintf(stdout, "Kick watchdog through writing over device file\n"); break; case 'i': ioctl(fd, WDIOC_KEEPALIVE, NULL); fprintf(stdout, "Kick watchdog through IOCTL\n"); break; case 'x': fprintf(stdout, "Goodbye !\n"); break; default: fprintf(stdout, "Unknown command\n"); break; } } while (kick_watchdog != 'x'); /* The 'V' value needs to be written into watchdog device file to indicate that we intend to close/stop the watchdog. Otherwise, debug message 'Watchdog timer closed unexpectedly' will be printed */ write(fd, "V", 1); /* Closing the watchdog device will deactivate the watchdog. */ close(fd); }
Setup Watchdog
Systemd watchdog service
Systemd provides full support for hardware watchdogs (as exposed in /dev/watchdog to userspace), as well as supervisor (software) watchdog support.To make use of the hardware watchdog it is sufficient to set the RuntimeWatchdogSec= option in /etc/systemd/system.conf.
#RuntimeWatchdogSec=0It defaults to 0 (i.e. no hardware watchdog use). Set it to a value like 20s and the watchdog is enabled.
Note that systemd will send a ping to the hardware at half the specified interval, i.e. every 10s.
Note that the hardware watchdog device (/dev/watchdog) is single-user only. That means that you can either enable this functionality in systemd, or use a separate external watchdog daemon.
Enable Watchdog
The watchdog is disabled by default. You can use the commands below to enable it.$ sudo systemctl enable watchdog.service $ sudo systemctl start watchdog.service
Check Watchdog Service
$ systemctl status watchdog.service
Setup Watchdog Timeout
The default timeout is 15S, you can edit file /etc/watchdog.conf to change the timeout watchdog-timeout.Then , restart the service:
$ sudo systemctl restart watchdog.service
Test the watchdog
- Trigger a Kernel Crash Crashes the system without first unmounting file systems or syncing disks attached to the system
$ sudo -i # echo c > /proc/sysrq-trigger
Enable the software watchdog logic for a service
- The service unit file must include the clause WATCHDOG_USEC:
[Unit] Description=My Little Daemon Documentation=man:mylittled(8) [Service] ExecStart=/usr/bin/mylittled WatchdogSec=30s Restart=on-failure StartLimitInterval=5min StartLimitBurst=4 StartLimitAction=reboot-force
留言