Kernel Debugging


Debug Hacks


Basics


#4 Core Dump of a Process


The default action of certain signals is to cause a process to terminate and produce a core dump file, a disk file containing an image of the process's memory at the time of termination.
This image can be used in a debugger (e.g., gdb(1)) to inspect the state of the program at the time that it terminated.

一般的執行shell的環境是限制core file的產生:

$ ulimit -c
0
-c: The maximum size of core files created.

要設成允許core file的產生:

$ ulimit -c unlimited
You can see a process’s limits by running cat /proc/PID/limit.
Use the following code for testing:

#include <string.h>

int main(){
 char *ptr=NULL;

 *ptr=0;

}
to generate core file

$ gcc -g seg.c -o ./test

$ ./test
Segmentation fault (core dumped)

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from './test', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: './test', platform: 'x86_64'

$ gdb -c core ./test
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
...
Reading symbols from ./test...(no debugging symbols found)...done.
[New LWP 7209]
Core was generated by `./test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055a3a2a8060a in main () at seg.c:6
6  *ptr=0;
(gdb) 
Note, debug information must be compiled in with -g so that the crash line can be interpreted.
Using gdb's list command 'l 6' can help to dump the source code around the crash line:

(gdb) l 6
1 #include 
2 
3 int main(){
4  char *ptr=NULL;
5 
6  *ptr=0;
7  
8 }


預設是在目前工作的目錄下產生core file, 但是對大型軟體來說, 很難去找到哪個程式執行時的工作目錄, 最好能有專門存放core file的目錄, 也可藉此控制產生的大小 Linux supports an alternate syntax for the /proc/sys/kernel/core_pattern file. If the first character of this file is a pipe symbol(|), then the remainder of the line is interpreted as a program to be executed.
Ubuntu桌面版預裝了Apport,它是一個錯誤收集系統,當一個應用程式崩潰或者出現Bug時候,Apport就會通過彈窗警告用戶並且詢問用戶是否提交崩潰報告。
Apport uses /proc/sys/kernel/core_pattern to directly pipe the core dump into apport:

|/usr/share/apport/apport %p %s %c %d %P

  • %p
  • PID of dumped process
  • %s
  • signal number
  • %c
  • the limit of dump size
  • %d
  • dump mode
  • %P
  • PID of dumped process
  • %e
  • program name
  • %h
  • hostname
  • %t/li>
    timestamp
Apport has logs in /var/log/apport.log, by default, it will ignore crashes from binaries that aren’t part of an Ubuntu packages. I didn’t feel like trying to convince Apport to give me my core dumps. I ended up just overriding this Apport business and setting kernel.core_pattern to

sysctl -w kernel.core_pattern=/tmp/core-%e.%p.%h.%t

The core dump file generated : /tmp/core-test.9950.jerry-Latitude-E6410.1577308679 The Linux-specific /proc/[pid]/coredump_filter file can be used to control which memory segments are written to the core dump file. This file is provided only if the kernel was built with the CONFIG_ELF_CORE configuration option. The value in the file is a bit mask of memory mapping types (see mmap(2)):
  • bit 0 Dump anonymous private mappings.
  • bit 1 Dump anonymous shared mappings.
  • bit 2 Dump file-backed private mappings.
  • bit 3 Dump file-backed shared mappings.
  • bit 4 Dump ELF headers.
  • bit 5 Dump private huge pages.
  • bit 6 Dump shared huge pages.
  • bit 7 Dump private DAX pages.
  • bit 8 Dump shared DAX pages.
By default, the following bits are set: 0, 1, 4, 5. (110011 , 0x00000033) If you don't want the huge shared memory to be dump, you can modify it.

echo 1 > /proc/[pid]/coredump_filter

#5 GDB Basic 1: Trace

  • Build the program with debug -g option
  • gcc -Wall -O2 -g Produce debugging information in the operating system's native format (stabs, COFF, XCOFF, or DWARF). -Werror will think warnings as errors.
  • Start gdb
  • 
    $ gdb program
    
    
  • Set breakpoint
  • 使用gdb執行(run)程式時, 他會執行直到碰到breakpoint才停住, 在每一個breakpoint你可顯示想觀察的資料, 設定其他的breakpoint, 或繼續執行.
    
    break position
    
    where position can be one of the following:
    • function name
    • line number
    • filename:function_name
    • filename:line_number
    • +offset
    • 目前停止的位址再繼續
    • -offset
    • 目前停止的位址再倒退
    • *address
    不指定position的話, 就把下一個指令設為breakpoint.
    
    (gdb) b main
    Breakpoint 1 at 0x5fe: file seg.c, line 4.
    (gdb) b 4
    Note: breakpoint 1 also set at pc 0x5fe.
    Breakpoint 2 at 0x5fe: file seg.c, line 4.
    
    
    顯示目前已設定的breakpoint:
    
    (gdb) info break
    
  • Delete breakpoint
  • It is often necessary to eliminate a breakpoint, watchpoint, or catchpoint once it has done its job and you no longer want your program to stop there. This is called deleting the breakpoint.
    • clear
    • Delete any breakpoints at the next instruction to be executed in the selected stack frame (see section Selecting a frame). When the innermost frame is selected, this is a good way to delete a breakpoint where your program just stopped.
    • clear function
    • clear filename:function
    • Delete any breakpoints set at entry to the function function.
    • clear linenum
    • clear filename:linenum
    • Delete any breakpoints set at or within the code of the specified line.
    • delete [breakpoints] [range...]
    • Delete the breakpoints, watchpoints, or catchpoints of the breakpoint ranges specified as arguments. If no argument is specified,
    • delete all
    • breakpoints (GDB asks confirmation, unless you have set confirm off). You can abbreviate this command as d.
  • run
  • run執行到breakpoint時會暫停
    
    run program_parameters
    
    把main設為breakpoint並執行到main是常見的作法, 指令start可以做到這些要求. 執行一行程式碼有兩方法:
    • next
    • 函數呼叫也只算是一行(不會跳進函數內執行)
    • step
    • 碰到函數呼叫時會跳進函數內執行
    若要執行一行組合語言指令, 要用的指令是nexi, stepi. continue會從目前的處境繼續執行直到碰到breakpoint時會暫停, 可以指定跳過breakpoint的次數:
    
    continue 次數
    
  • Set breakpoint with condition
  • 
    break position if cond
    
    Evaluate the expression cond each time the breakpoint is reached, and stop only if the value is nonzero -- that is, if cond evaluates as true.
  • Backtrace: Displays the call trace for the currently selected thread.
  • When your program has stopped, the first thing you need to know is where it stopped and how it got there. Each time your program performs a function call, information about the call is generated. That information includes:
    • the location of the call in your program
    • the arguments of the call
    • the local variables of the function being called
    The information is saved in a block of data called a stack frame. The stack frames are allocated in a region of memory called the call stack. When your program is started, the stack has only one frame, that of the function main. This is called the initial frame or the outermost frame. Each time a function is called, a new frame is made. Each time a function returns, the frame for that function invocation is eliminated. A backtrace is a summary of how your program got where it is. Each line in the backtrace shows the frame number and the function name. The backtrace also shows the source file name and line number, as well as the arguments to the function. For ex.,
    
    #include <stdio.h>
    
    void call_2()
    {
        printf("hello 2\n");
    }
    
    void call_1()
    {
        printf("hello 1\n");
        call_2();
    }
    
    int main()
    {
        printf("hello\n");
        call_1();
    
        return 0;
    }
    
    
    Set the breakpoint in call_2() then backtrace:
    
    (gdb) b call_2
    Breakpoint 1 at 0x63e: file stack.c, line 5.
    (gdb) r
    Starting program: /home/jerry/test/a.out 
    hello
    hello 1
    
    Breakpoint 1, call_2 () at stack.c:5
    5     printf("hello 2\n");
    (gdb) bt
    #0  call_2 () at stack.c:5
    #1  0x0000555555554667 in call_1 () at stack.c:11
    #2  0x0000555555554684 in main () at stack.c:17
    
    
  • Set variables
  • 
    set variable 變數=運算式
    
  • print: Examining Data
  • print evaluates and prints the value of an expression of the language your program is written in.
    
    print expr
    
    
    info registers prints the names and values of all registers except floating-point registers (in the selected stack frame). To print a register:
    
    (gdb) p $registerName
    
    可以指定要顯示的格式: p/格式. 格式:
    • x
    • hex
    • d
    • decimal
    • c
    • character
    • s
    • string
    • t
    • binary
    • a
    • address
    You can use the command x (for "examine") to examine memory:
    
    (gdb) x $pc
    0x55555555466e <main+4>: 0xaf3d8d48
    
    
    Besides, you can disassemble the content of the address
    
    (gdb) x/i $pc
    => 0x55555555466e : lea    0xaf(%rip),%rdi        # 0x555555554724
    
    In general, we want to dump and interpret a series of memory,
    
     x/NFU addr
    
    • Numbers of the repeat count
    • Format of the displayed result
    • `s' (null-terminated string), or `i' (machine instruction). The default is `x' (hexadecimal) initially.
    • Unit size
    • b: Bytes. h: Halfwords (two bytes). w: Words (four bytes). This is the initial default. g: Giant words (eight bytes).
    For ex.,
    
    (gdb) x/5i $pc
    => 0x55555555466e : lea    0xaf(%rip),%rdi        # 0x555555554724
    
    You can use gdb to disassembles a specified function or a fragment of memory:
    
    disassemble
    disassemble [Function]
    disassemble [Address]
    disassemble [Start],[End]
    disassemble [Function],+[Length]
    disassemble [Address],+[Length]
    disassemble /m [...]
    disassemble /r [...]
    
  • watchpoint
  • You can use a watchpoint to stop execution whenever the value of an expression changes, without having to predict a particular place where this may happen.
    • Set a watchpoint for an expression. GDB will break when expr is written into by the program and its value changes.
    • 
      watch expr
      
    • Set a watchpoint that will break when watch expr is read by the program.
    • 
      rwatch expr
      
    • Set a watchpoint that will break when expr is either read or written into by the program.
    • 
      awatch expr
      
    • This command prints a list of watchpoints, breakpoints, and catchpoints; it is the same as info break.
    • 
      info watchpoints
      
  • Generate core file
  • 對目前被debug的程式產生crde file:
    
    generate-core-file
    
    對正在執行中的process產生core file:
    
    $ gcore PID
    

#6 GDB Basic 2

  • Debugging an already-running process
  • 
    attach process-id
    
    This command attaches to a running process--one that was started outside GDB. (info files shows your active targets.) The first thing GDB does after arranging to debug the specified process is to stop it. You can examine and modify an attached process with all the GDB commands that are ordinarily available when you start processes with run. When you have finished debugging the attached process, you can use the detach command to release it from GDB control. After the detach command, that process and GDB become completely independent once more. If you exit GDB or use the run command while you have an attached process, you kill that process. 這可以讓你透過backtrace觀察什麼原因造成一個process在等待或是在一個無窮迴圈裏
  • Break conditions
  • A condition is just a Boolean expression in your programming language. For ex.,
    
     if ( node == 0 ) 
    
    當特定條件符合時 才會跳至中斷位址:
    
      break position 
    
    替某個中斷編號增加條件:
    
      condition #break condition
    
    從某個中斷編號中刪除條件:
    
      condition #break
    
  • Breakpoint command lists
  • You can give any breakpoint (or watchpoint or catchpoint) a series of commands to execute when your program stops due to that breakpoint. Specify a list of commands for breakpoint number bnum:
    
      commands #break
      ... command-list ...
      end
    

#8 Intel Architecture Basic

#9 Stack

The primary purpose of a call stack is to store the return addresses. When a subroutine is called, the location (address) of the instruction at which the calling routine can later resume needs to be saved somewhere. A call stack is composed of stack frames. These are machine dependent and ABI-dependent data structures containing subroutine state information. Each stack frame corresponds to a call to a subroutine which has not yet terminated with a return. For example, if a subroutine named DrawLine() is currently running, having been called by a subroutine DrawSquare(), the top part of the call stack might be laid out like this: The stack frame at the top of the stack is for the currently executing routine. The stack frame usually includes at least the following items (in push order):
  • the arguments (parameter values) passed to the routine (if any)
  • the return address back to the routine's caller (e.g. in the DrawLine() stack frame, an address into DrawSquare()'s code)
  • space for the local variables of the calling routine (if any).

Chapter 3 Prepare for Kernel Debug

Decode Oops Messages

Bug hunting: Kernel bug reports often come with a stack dump, depending on the severity of the issue, it may also contain the word Oops. The following file oopsdemo.c is used to generate oops:

#include <linux/init.h>
#include <linux/module.h>
MODULE_LICENSE("Dual BSD/GPL");

static int init_oopsdemo(void) {
  *((int *) 0x00 ) = 0x123456;
  return 0;
}

static void cleanup_oopsdemo(void){

}

module_init(init_oopsdemo);
module_exit(cleanup_oopsdemo);

Makefile:

# If KERNELRELEASE is defined, we've been invoked in the kernel build system
ifneq ($(KERNELRELEASE),)
obj-m := oopsdemo.o
# Otherwise we were called directly from the command line
else
KERNELDIR ?= /lib/modules/$(shell uname -r)/build
PWD  := $(shell pwd)
default:
        $(MAKE) -C $(KERNELDIR) M=$(PWD) modules
endif

Load the oopsdemo.ko module:


[  618.291854] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  618.291864] #PF error: [WRITE]
[  618.291866] PGD 0 P4D 0 
[  618.291877] Oops: 0002 [#1] SMP PTI

[  618.291884] CPU: 0 PID: 7275 Comm: insmod Tainted: G           OE     5.0.0-23-generic #24~18.04.1-Ubuntu
[  618.291886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  618.291897] RIP: 0010:init_oopsdemo+0x8/0x20 [oopsdemo]
[  618.291903] Code: Bad RIP value.
[  618.291904] RSP: 0018:ffffafad41b03c70 EFLAGS: 00010246
[  618.291906] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  618.291908] RDX: 000000000000f1f9 RSI: 00000000006000c0 RDI: ffffffffc0529000
[  618.291909] RBP: ffffafad41b03ce8 R08: ffff95dbfda27080 R09: ffff95dbfd401900
[  618.291910] R10: ffffe40e01e88580 R11: ffff95dbfffae000 R12: ffffffffc0529000
[  618.291911] R13: ffff95dbb560eea0 R14: ffffafad41b03e78 R15: ffffffffc052b000
[  618.291913] FS:  00007f889cc40540(0000) GS:ffff95dbfda00000(0000) knlGS:0000000000000000
[  618.291914] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  618.291916] CR2: ffffffffc0528fde CR3: 0000000078fd6000 CR4: 00000000000006f0
[  618.291920] Call Trace:
[  618.291946]  ? do_one_initcall+0x4a/0x1c9
[  618.291960]  ? _cond_resched+0x19/0x40
[  618.291966]  ? kmem_cache_alloc_trace+0x42/0x1c0
[  618.291971]  do_init_module+0x5f/0x216
[  618.291974]  load_module+0x19f6/0x20a0
[  618.291978]  __do_sys_finit_module+0xfc/0x120
[  618.291979]  ? __do_sys_finit_module+0xfc/0x120
[  618.291982]  __x64_sys_finit_module+0x1a/0x20
[  618.291984]  do_syscall_64+0x5a/0x120
[  618.291987]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  618.291990] RIP: 0033:0x7f889c756839
[  618.291993] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
[  618.291995] RSP: 002b:00007fff7a2e7ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  618.291997] RAX: ffffffffffffffda RBX: 00005556adb9d780 RCX: 00007f889c756839
[  618.291998] RDX: 0000000000000000 RSI: 00005556ad454d2e RDI: 0000000000000003
[  618.292000] RBP: 00005556ad454d2e R08: 0000000000000000 R09: 00007f889ca29000
[  618.292001] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[  618.292002] R13: 00005556adb9ff70 R14: 0000000000000000 R15: 0000000000000000
[  618.292004] Modules linked in: oopsdemo(OE+) snd_hda_codec_generic ledtrig_audio snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul ghash_clmulni_intel snd_hda_core joydev aesni_intel snd_hwdep aes_x86_64 snd_pcm qxl snd_seq_midi crypto_simd snd_seq_midi_event ttm snd_rawmidi cryptd snd_seq glue_helper drm_kms_helper snd_seq_device snd_timer snd drm fb_sys_fops soundcore syscopyarea sysfillrect sysimgblt input_leds serio_raw qemu_fw_cfg mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid psmouse virtio_blk virtio_net net_failover failover i2c_piix4 pata_acpi floppy
[  618.292057] CR2: 0000000000000000
[  618.292064] ---[ end trace dfd9153d646f12aa ]---
[  618.292068] RIP: 0010:init_oopsdemo+0x8/0x20 [oopsdemo]
[  618.292072] Code: Bad RIP value.
[  618.292074] RSP: 0018:ffffafad41b03c70 EFLAGS: 00010246
[  618.292075] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  618.292082] RDX: 000000000000f1f9 RSI: 00000000006000c0 RDI: ffffffffc0529000
[  618.292084] RBP: ffffafad41b03ce8 R08: ffff95dbfda27080 R09: ffff95dbfd401900
[  618.292085] R10: ffffe40e01e88580 R11: ffff95dbfffae000 R12: ffffffffc0529000
[  618.292087] R13: ffff95dbb560eea0 R14: ffffafad41b03e78 R15: ffffffffc052b000
[  618.292088] FS:  00007f889cc40540(0000) GS:ffff95dbfda00000(0000) knlGS:0000000000000000
[  618.292090] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  618.292091] CR2: ffffffffc0528fde CR3: 0000000078fd6000 CR4: 00000000000006f0

  • Oops: error code [# ]
  • This is the error code value in hex. Each bit has a significance of its own:
    • bit 0
    • 0 means no page found, 1 means a protection fault
    • bit 1
    • 0 means read, 1 means write
    • bit 2
    • 0 means kernel, 1 means user-mode
    [# ] is the number of times the Oops occurred.
  • CPU:
  • which CPU the error occurred
  • PID:
  • Comm:
  • Tainted:
  • The Tainted flag is defined in kernel/panic.c:
    • P
    • Proprietary module has been loaded.
    • F
    • Module has been forcibly loaded.
    • S
    • SMP with a CPU not designed for SMP.
    • R
    • User forced a module unload.
    • M
    • System experienced a machine check exception.
    • B
    • System has hit bad_page.
    • U
    • Userspace-defined naughtiness.
    • A
    • ACPI table overridden.
    • W
    • Taint on warning.
  • Call Trace
  • Oops reports end with a stack dump and (possibly lengthy) stack backtrace showing what caused the kernel to reach the point it is at. In this case, a kernel stack backtrace alongside register contents and other pertinent information is printed to the system console and also recorded to the system logs. Such stack traces provide enough information to identify the line inside the Kernel’s source code where the bug happened.
Normally the Oops text is read from the kernel buffers by klogd and handed to syslogd which writes it to a syslog file, typically /var/log/messages (depends on /etc/syslog.conf). On systems with systemd, it may also be stored by the journald daemon, and accessed by running journalctl command. If klogd dies, you can run:

 dmesg > file
or,

cat /proc/kmsg > file
If the machine has crashed so badly that you cannot enter commands or the disk is not available, you need to get the Oops by:
  • Hand copy the text from the screen/terminal
  • Use Kdump
To find the bug’s location:
  • objdump
  • To debug a kernel, use objdump and look for the hex offset from the crash output to find the valid line of code/assembler.
    
    $ objdump -r -S -l --disassemble oopsdemo.o
    
    oopsdemo.o:     file format elf64-x86-64
    
    
    Disassembly of section .text:
    
    0000000000000000 :
    init_oopsdemo():
       0: e8 00 00 00 00        callq  5 
       1: R_X86_64_PC32 __fentry__-0x4
       5: 55                    push   %rbp
       6: 31 c0                 xor    %eax,%eax
       8: c7 04 25 00 00 00 00  movl   $0x123456,0x0
       f: 56 34 12 00 
      13: 48 89 e5              mov    %rsp,%rbp
      16: 5d                    pop    %rbp
      17: c3                    retq   
      18: 0f 1f 84 00 00 00 00  nopl   0x0(%rax,%rax,1)
      1f: 00 
    
    0000000000000020 :
    cleanup_oopsdemo():
      20: e8 00 00 00 00        callq  25 
       21: R_X86_64_PC32 __fentry__-0x4
      25: 55                    push   %rbp
      26: 48 89 e5              mov    %rsp,%rbp
      29: 5d                    pop    %rbp
      2a: c3                    retq  
    
  • gdb
  • There are two methods for doing that. Usually, using gdb is easier, but the Kernel should be pre-compiled with debug info by enabling kernel option: CONFIG_DEBUG_INFO. To check if the kernel is compiled with CONFIG_DEBUG_INFO:
    
    $ grep CONFIG_DEBUG_INFO /boot/config-`uname -r`
    CONFIG_DEBUG_INFO=y
    
    
    On a kernel compiled with CONFIG_DEBUG_INFO, you can simply copy the EIP value from the OOPS then use GDB to translate that to human-readable form:
    
    $ gdb vmlinux
    (gdb) l *0xc021e50e
    
    

#18 Linux Magic System Request Key

It is a ‘magical’ key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up. To enable the magic SysRq key:

  CONFIG_MAGIC_SYSRQ=y
/proc/sys/kernel/sysrq controls the functions allowed to be invoked via the SysRq key:
  • 0
  • disable sysrq completely
  • 1
  • enable all functions of sysrq
  • bitmask value
  • 
      2 =   0x2 - enable control of console logging level
      4 =   0x4 - enable control of keyboard (SAK, unraw)
      8 =   0x8 - enable debugging dumps of processes etc.
     16 =  0x10 - enable sync command
     32 =  0x20 - enable remount read-only
     64 =  0x40 - enable signalling of processes (term, kill, oom-kill)
    128 =  0x80 - allow reboot/poweroff
    256 = 0x100 - allow nicing of all RT tasks
    
To use the magic SysRq key, write a command character to /proc/sysrq-trigger. e.g.:

echo m > /proc/sysrq-trigger
To enable all functions:

$ sudo -i
# echo 1 > /proc/sys/kernel/sysrq
# echo m > /proc/sysrq-trigger
# dmesg
...
[ 8575.330056] sysrq: SysRq : Show Memory
[ 8575.330067] Mem-Info:
...

Following are the command keys available for Alt+SysRq+commandkey.
‘k’ – Kills all the process running on the current virtual console.
‘s’ – This will attempt to sync all the mounted file system.
‘b’ – Immediately reboot the system, without unmounting partitions or syncing.
'c' – Generate crash dump
‘e’ – Sends SIGTERM to all process except init.
‘m’ – Output current memory information to the console.
‘i’ – Send the SIGKILL signal to all processes except init
‘r’ – Switch the keyboard from raw mode (the mode used by programs such as X11), to XLATE mode.
‘s’ – sync all mounted file system.
‘t’ – Output a list of current tasks and their information to the console.
‘u’ – Remount all mounted filesystems in readonly mode.
‘o’ – Shutdown the system immediately.
‘p’ – Print the current registers and flags to the console.
‘0-9’ – Sets the console log level, controlling which kernel messages will be printed to your console.
‘f’ – Will call oom_kill to kill process which takes more memory.
‘h’ – Used to display the help. But any other keys than the above listed will print help.

Use Kdump to get kernel crash dump

Kdump is a kernel crash dumping mechanism that allows you to save the contents of the system’s memory for later analysis. It relies on kexec, which can be used to boot a Linux kernel from the context of another kernel, bypass BIOS, and preserve the contents of the first kernel’s memory that would otherwise be lost. The kernel crash dump utility is installed with the following command:

sudo apt install linux-crashdump

The kdump mechanism will be enabled during the installation. Kdump utilizes two kernels:
  • System kernel
  • is a normal kernel that is booted with special kdump-specific flags. We need to tell the system kernel to reserve some amount of physical memory where dump-capture kernel will be loaded.
  • Dump capture kernel
  • Once kernel crash happens the kernel crash handler uses Kexec mechanism to boot dump capture kernel. Please note that memory with system kernel is untouched and accessible from dump capture kernel as seen at the moment of crash. Once dump capture kernel is booted, the user can use the file /proc/vmcore to get access to memory of crashed system kernel.
In real production environments system and dump capture kernel will be different - system kernel needs a lot of features and compiled with a many kernel flags/drivers. While dump capture kernel goal is to be minimalistic and take as small amount of memory as possible.
  1. Compiling the dump capture kernel
  2. To create a kernel you need to edit kernel config (or config.x86_64) file and enable following configuration options:
    • CONFIG_DEBUG_INFO=y
    • CONFIG_CRASH_DUMP=y
    • CONFIG_PROC_VMCORE=y
    • then check the built kernel config: /boot/config-5.0.0-23-generic .
  3. Setup kdump kernel
  4. To reserve memory for dump capture kernel. Edit you bootloader configuration and add
    
      crashkernel=64M
    
    boot option to the system kernel you just installed. For GRUB as an ex., on an installed system, GRUB loads the /boot/grub/grub.cfg configuration file each boot. That grub.cfg file can be generated, the generation process can be influenced by a variety of options in /etc/default/grub and scripts in /etc/grub.d/:
    • /etc/default/grub
    • 
      GRUB_TIMEOUT_STYLE=menu
      GRUB_CMDLINE_LINUX="crashkernel=128M"
      
      
    • /etc/grub.d/
    Remember to always generate the main configuration file by running 'update-grub' after making changes to /etc/default/grub and/or files in /etc/grub.d/. make sure "crashkernel=128M" has been added to boot the kernel,
    
    $ cat /proc/cmdline
    BOOT_IMAGE=/boot/vmlinuz-5.0.0-37-generic root=UUID=900ed0fb-963a-4962-a941-2f6056f43d9e ro crashkernel=128M quiet splash vt.handoff=1
    
    
    You can also set the amount of reserved memory to be variable, depending on the total amount of installed memory:
    
    crashkernel=(range1):(size1),(range2):(size2)
    
    For example: "crashkernel=512M-2G:64M,2G-:128M", this reserves 64 MB of memory if the total amount of system memory is between 512 MB and 2 GB, 128 MB is reserved if the total amount of system memory is more than 2 GB. To offset the reserved memory, use the following syntax:
    
    crashkernel=128M@16M
    
    
    This reserves 128 MB of memory starting at 16 MB (physical address 0x01000000)
  5. Configuring the kdump type
  6. When a kernel crash is captured, the core dump can be either stored as:
    • a file in a local file system
    • written directly to a device
    • sent over a network using the NFS (Network File System) or SSH (Secure Shell) protocol
To confirm that the kernel dump mechanism is enabled,
  1. the crashkernel boot parameter is present
  2. 
    cat /proc/cmdline
    
  3. the requested memory area for the kdump kernel is reserved
  4. 
    $ dmesg | grep -i crash
    
    
  5. display the current config
  6. 
    $ kdump-config show
    DUMP_MODE:        kdump
    USE_KDUMP:        1
    KDUMP_SYSCTL:     kernel.panic_on_oops=1
    KDUMP_COREDIR:    /var/crash
    crashkernel addr: 0x
       /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.0.0-37-generic
    kdump initrd: 
       /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-5.0.0-37-generic
    current state:    ready to kdump
    
    kexec command:
      /sbin/kexec -p --command-line="BOOT_IMAGE=/boot/vmlinuz-5.0.0-37-generic root=UUID=900ed0fb-963a-4962-a941-2f6056f43d9e ro quiet splash vt.handoff=1 systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
    
    
Testing the Crash Dump Mechanism,

$ sudo -s
# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger

After the system is booted normally, you will then find the Kernel Crash Dump file, and related subdirectories, in the /var/crash directory :

drwxr-sr-x 2 root whoopsie  4096  一   7 13:44 202001071344
-rw-r--r-- 1 root whoopsie   298  一   7 13:44 kexec_cmd
-rw-r----- 1 root whoopsie 17439  一   7 13:44 linux-image-5.0.0-37-generic-202001071344.crash

Using the crash utility

To determine the cause of the system crash, you can use the crash utility, which provides an interactive prompt very similar to the GNU Debugger (GDB). This utility allows you to interactively analyze a running Linux system as well as a core dump created by netdump, diskdump, xendump, or kdump. Install crash packages,

sudo apt-get install crash
Running the crash utility,

crash vmlinux  /var/crash/(timestamp)/vmcore

Chapter 6 Debug Tips

#56 Surviving the Linux OOM Killer

When your Linux machine runs out of memory, Out of Memory (OOM) killer is called by kernel to free some memory. The Linux kernel gives a score to each running process called oom_score which shows how likely it is to be terminated in case of low available memory. The oom_score of a process can be found in the /proc directory. The OOM killer checks oom_score_adj(-1000 to 1000) to adjust its final calculated score. To check if any of your processes have been OOM-killed:

grep -i kill /var/log/syslog
( -i, --ignore-case )

Hacking: The Art of Exploitation, 2nd Edition

by Jon Erickson

Chapter 0x200. PROGRAMMING

0x250. Getting Your Hands Dirty

firstprog.c:

#include <stdio.h>
int main() {
  int i;
  for(i=0; i < 10; i++) {
    puts("Hello, world!\n");
  }
  return 0;
}

0x251. The Bigger Picture

The GNU development tools include a program called objdump, which can be used to examine compiled binaries.

$ objdump -D a.out | grep -A20 main.:
000000000000063a <main>:
 63a: 55                    push   %rbp
 63b: 48 89 e5              mov    %rsp,%rbp
 63e: 48 83 ec 10           sub    $0x10,%rsp
 642: c7 45 fc 00 00 00 00  movl   $0x0,-0x4(%rbp)
 649: eb 10                 jmp    65b 
 64b: 48 8d 3d a2 00 00 00  lea    0xa2(%rip),%rdi        # 6f4 <_IO_stdin_used+0x4>
 652: e8 b9 fe ff ff        callq  510 
 657: 83 45 fc 01           addl   $0x1,-0x4(%rbp)
 65b: 83 7d fc 09           cmpl   $0x9,-0x4(%rbp)
 65f: 7e ea                 jle    64b 
 661: b8 00 00 00 00        mov    $0x0,%eax
 666: c9                    leaveq 
 667: c3                    retq   
 668: 0f 1f 84 00 00 00 00  nopl   0x0(%rax,%rax,1)
 66f: 00 

0000000000000670 <__libc_csu_init>:
 670: 41 57                 push   %r15
 672: 41 56                 push   %r14
 674: 49 89 d7              mov    %rdx,%r15
grep with the command-line option "-A20" to only display 20 lines after the regular expression main.:. The same code can be shown in Intel syntax by providing an additional command-line option, "-M intel", to objdump,

$ objdump -M intel -D a.out | grep -A20 main.:
000000000000063a <main>: 
 63a: 55                    push   rbp
 63b: 48 89 e5              mov    rbp,rsp
 63e: 48 83 ec 10           sub    rsp,0x10
 642: c7 45 fc 00 00 00 00  mov    DWORD PTR [rbp-0x4],0x0
 649: eb 10                 jmp    65b 
 64b: 48 8d 3d a2 00 00 00  lea    rdi,[rip+0xa2]        # 6f4 <_IO_stdin_used+0x4>
 652: e8 b9 fe ff ff        call   510 
 657: 83 45 fc 01           add    DWORD PTR [rbp-0x4],0x1
 65b: 83 7d fc 09           cmp    DWORD PTR [rbp-0x4],0x9
 65f: 7e ea                 jle    64b 
 661: b8 00 00 00 00        mov    eax,0x0
 666: c9                    leave  
 667: c3                    ret    
 668: 0f 1f 84 00 00 00 00  nop    DWORD PTR [rax+rax*1+0x0]
 66f: 00 

0000000000000670 <__libc_csu_init>:
 670: 41 57                 push   r15
 672: 41 56                 push   r14
 674: 49 89 d7              mov    r15,rdx

0x252. The x64 Processor


$ gdb -q ./a.out
Reading symbols from ./a.out...done.
(gdb) break main
Breakpoint 1 at 0x642: file firstprog.c, line 4.
(gdb) run
Starting program: /home/jerry/test/a.out 

Breakpoint 1, main () at firstprog.c:4
4   for(i=0; i < 10; i++) {
(gdb) info registers
rax            0x55555555463a 93824992233018
rbx            0x0 0
rcx            0x555555554670 93824992233072
rdx            0x7fffffffdd78 140737488346488
rsi            0x7fffffffdd68 140737488346472
rdi            0x1 1
rbp            0x7fffffffdc80 0x7fffffffdc80
rsp            0x7fffffffdc70 0x7fffffffdc70
r8             0x7ffff7dd0d80 140737351847296
r9             0x7ffff7dd0d80 140737351847296
r10            0x2 2
r11            0x3 3
r12            0x555555554530 93824992232752
r13            0x7fffffffdd60 140737488346464
r14            0x0 0
r15            0x0 0
rip            0x555555554642 0x555555554642 
eflags         0x202 [ IF ]
cs             0x33 51
ss             0x2b 43
ds             0x0 0
es             0x0 0
fs             0x0 0
gs             0x0 0

There are sixteen 64-bit registers in x86-64. By convention,
  • %rax is used to store a function’s return value, if it exists and is no more than 64 bits long.
  • %rbx, %rbp, and %r12-r15 are callee-save registers, meaning that they are saved across function calls.
  • %rsp is used as the stack pointer, a pointer to the topmost element in the stack.
  • %rdi, %rsi, %rdx, %rcx, %r8, and %r9 are used to pass the first six integer or pointer parameters to called functions. Additional parameters (or large parameters such as structs passed by value) are passed on the stack.

0x253. Assembly Language


White Paper: Red Hat Crash Utility

Abstract

Crash is a tool for interactively analyzing the state of the Linux system while it is running, or after a kernel crash has occurred and a core dump has been created by the netdump, diskdump, LKCD, kdump, xendump or kvmdump facilities. It is loosely based on the SVR4 UNIX crash command, but has been significantly enhanced by completely merging it with the gdb debugger. The crash utility is designed to be independent of Linux version dependencies.

Prerequisites

The crash utility has the following prerequisites:
  • kernel object file
  • A vmlinux kernel object file must have been built with the -g C flag.
  • memory image
  • A kernel crash dump file generated from any of the supported dump facilties, or live system memory accessed via /dev/mem ( /dev/crash). If no dump file argument is issued on the crash command line, live system memory will be used by default. When examining a live system, root privileges are required.
  • platform processor types
  • The crash utility is actively developed and tested on the x86, x86_64, ia64, ppc64, arm, s390 and s390x processors.
  • Linux kernel versions
  • The crash utility is backwards-compatible.

Installation


$ sudo apt-get install crash

Invocation

When crash is run on a dumpfile, at least two arguments are always required:
  • The kernel object filename
  • vmlinux
  • The dumpfile name
  • vmcore

pr_debug()

 “Kernel hacking” à kobject debugging [CONFIG_DEBUG_KOBJECT]
    Some files call pr_debug(), which is ordinarily an empty macro that discards
    its arguments at compile time.  To enable debugging output, build the
    appropriate file with -DDEBUG by adding
      CFLAGS_[filename].o := -DDEBUG
    to the makefile.
    For example, to see all attempts to spawn a usermode helper (such as
    /sbin/hotplug), add to lib/Makefile the line:
        CFLAGS_kobject_uevent.o := -DDEBUG
    Then boot the new kernel, do something that spawns a usermode helper, and
    use the "dmesg" command to view the pr_debug() output.
Debugging Support in the Kernel
Except where specified otherwise, all of these options are found under the “kernel hacking” menu in whatever kernel configuration tool you prefer. Note that some of these options are not supported by all architectures.
CONFIG_DEBUG_KERNEL
This option just makes other debugging options available; it should be turned on but does not, by itself, enable any features.
CONFIG_DEBUG_SLAB
This crucial option turns on several types of checks in the kernel memory alloca- tion functions; with these checks enabled, it is possible to detect a number of memory overrun and missing initialization errors. Each byte of allocated memory is set to 0xa5 before being handed to the caller and then set to 0x6b when it is freed. If you ever see either of those “poison” patterns repeating in output from your driver (or often in an oops listing), you’ll know exactly what sort of error to look for. When debugging is enabled, the kernel also places special guard values before and after every allocated memory object; if those values ever get changed, the kernel knows that somebody has overrun a memory allocation, and it com- plains loudly. Various checks for more obscure errors are enabled as well.

CONFIG_DEBUG_PAGEALLOC
Full pages are removed from the kernel address space when freed. This option can slow things down significantly, but it can also quickly point out certain kinds of memory corruption errors.
CONFIG_DEBUG_SPINLOCK
With this option enabled, the kernel catches operations on uninitialized spin- locks and various other errors (such as unlocking a lock twice).
CONFIG_DEBUG_SPINLOCK_SLEEP
This option enables a check for attempts to sleep while holding a spinlock. In fact, it complains if you call a function that could potentially sleep, even if the call in question would not sleep.
CONFIG_INIT_DEBUG Items marked with __init (or __initdata) are discarded after system initializa- tion or module load time. This option enables checks for code that attempts to access initialization-time memory after initialization is complete.
CONFIG_DEBUG_INFO
This option causes the kernel to be built with full debugging information included. You’ll need that information if you want to debug the kernel with gdb. You may also want to enable CONFIG_FRAME_POINTER if you plan to use gdb.
CONFIG_MAGIC_SYSRQ
Enables the “magic SysRq” key. We look at this key in the section “System Hangs,” later in this chapter.
CONFIG_DEBUG_STACKOVERFLOW
CONFIG_DEBUG_STACK_USAGE
These options can help track down kernel stack overflows. A sure sign of a stack overflow is an oops listing without any sort of reasonable back trace. The first option adds explicit overflow checks to the kernel; the second causes the kernel to monitor stack usage and make some statistics available via the magic SysRq key.
CONFIG_KALLSYMS
This option (under “General setup/Standard features”) causes kernel symbol information to be built into the kernel; it is enabled by default. The symbol information is used in debugging contexts; without it, an oops listing can give you a kernel traceback only in hexadecimal, which is not very useful.
CONFIG_IKCONFIG
CONFIG_IKCONFIG_PROC
These options (found in the “General setup” menu) cause the full kernel config- uration state to be built into the kernel and to be made available via /proc. Most kernel developers know which configuration they used and do not need these options (which make the kernel bigger). They can be useful, though, if you are trying to debug a problem in a kernel built by somebody else.

CONFIG_ACPI_DEBUG
Under “Power management/ACPI.” This option turns on verbose ACPI (Advanced Configuration and Power Interface) debugging information, which can be useful if you suspect a problem related to ACPI.
CONFIG_DEBUG_DRIVER
Under “Device drivers.” Turns on debugging information in the driver core, which can be useful for tracking down problems in the low-level support code. We’ll look at the driver core in Chapter 14.

CONFIG_SCSI_CONSTANTS
This option, found under “Device drivers/SCSI device support,” builds in infor- mation for verbose SCSI error messages. If you are working on a SCSI driver, you probably want this option.

CONFIG_INPUT_EVBUG
This option (under “Device drivers/Input device support”) turns on verbose log- ging of input events. If you are working on a driver for an input device, this option may be helpful. Be aware of the security implications of this option, how- ever: it logs everything you type, including your passwords.

CONFIG_PROFILING
This option is found under “Profiling support.” Profiling is normally used for system performance tuning, but it can also be useful for tracking down some kernel hangs and related problems.
We will revisit some of the above options as we look at various ways of tracking down kernel problems. But first, we will look at the classic debugging technique: print statements.
KERN_EMERG
KERN_ALERT
KERN_CRIT
KERN_ERR
KERN_WARNING
KERN_NOTICE
KERN_INFO
An emergency condition; the system is probably dead. A problem that requires immediate attention.
A critical condition.
An error.

A warning.
A normal, but perhaps noteworthy, condition. An informational message.

KERN_DEBUG
A debug message—typically superfluous.
 If the priority is less than the integer variable console_loglevel, the message is delivered to the console one line at a time (nothing is sent unless a trailing newline is provided). 
The variable console_loglevel is initialized to DEFAULT_CONSOLE_LOGLEVEL
It is also possible to read and modify the console loglevel using the text file /proc/sys/ kernel/printk
The file hosts four integer values: the current loglevel, the default level for messages that lack an explicit loglevel, the minimum allowed loglevel, and the boot-time default loglevel. 
Writing a single value to this file changes the current loglevel to that value; thus, for example, you can cause all kernel messages to appear at the console by simply entering:
# echo 8 > /proc/sys/kernel/printk
It should now be apparent why the hello.c sample had the KERN_ALERT; markers; they
are there to make sure that the messages appear on the console. 
It is worth the memory use, however, during not only development but also deployment.  The configuration option CONFIG_KALLSYMS_ALL additionally stores the symbolic name of all symbols, not only functions.This is generally needed only by specialized debuggers.  The CONFIG_KALLSYMS_EXTRA_PASS option causes the kernel build process to make a second pass over the kernel’s object code. It is useful only when debugging kallsyms itself.  Thanks to kernel preemption, the kernel has a central atomicity counter.The kernel can be set such that if a task sleeps while atomic, or even does something that might sleep, the kernel prints a warning and provides a back trace. Potential bugs that are detectable include calling schedule() while holding a lock, issuing a blocking memory allocation while holding a lock, or sleeping while holding a reference to per-CPU data.This debugging infrastructure catches a lot of bugs and is highly recommended. The following options make the best use of this feature: CONFIG_PREEMPT=y  CONFIG_DEBUG_KERNEL=y  CONFIG_KALLSYMS=y  CONFIG_DEBUG_SPINLOCK_SLEEP=y  Most architectures define BUG() and BUG_ON() as illegal instructions, which result in the desired oops.You normally use these routines as assertions, to flag situations that should not happen: if (bad_thing) BUG(); Or even better BUG_ON(bad_thing); A more critical error is signaled via panic() A call to panic() prints an error message and then halts the kernel.  Obviously, you want to use it only in the worst of situations: if (terrible_thing) panic(“terrible_thing is %ld!\n”, terrible_thing); Sometimes, you just want a simple stack trace issued on the console to help you in debugging.  In those cases, dump_stack()is used.  It simply dumps the contents of the reg- isters and a function back trace to the console: if (!debug_check) { printk(KERN_DEBUG “provide some information...\n”);  dump_stack(); }

留言

熱門文章