Linux Server Tips

Linux OS installation, configuration



init, systemd


The init process is the first process that is run by the kernel at the end of bootstrap procedure.

In the past days:

init usually reads system-dependent initialization files from /etc/rc/* and bring the system to a state which is defined in the /etc/inittab file.
You can use the init command to reread this file or change to a new runlevel.
For example, to change from your current runlevel to runlevel 1 (the single-user mode), you can use the init 1 command.


"Runlevels" are an obsolete way to start and stop groups of services used in SysV init.
systemd provides a compatibility layer that maps runlevels to targets, and associated binaries like runlevel.
Mapping between runlevels and systemd targets:

       ┌─────────┬───────────────────┐
       │Runlevel │ Target            │
       ├─────────┼───────────────────┤
       │0        │ poweroff.target   │
       ├─────────┼───────────────────┤
       │1        │ rescue.target     │
       ├─────────┼───────────────────┤
       │2, 3, 4  │ multi-user.target │
       ├─────────┼───────────────────┤
       │5        │ graphical.target  │
       ├─────────┼───────────────────┤
       │6        │ reboot.target     │
       └─────────┴───────────────────┘
The command runlevel prints the previous and current SysV runlevel if they are known.

$ runlevel
N 5

The two runlevel characters are separated by a single space character. If a runlevel cannot be determined, N is printed instead. If neither can be determined, the word "unknown" is printed.


  /sbin/init -> /lib/systemd/systemd
systemd is a system and service manager for Linux operating systems. When run as first process on boot (as PID 1), it acts as init system that brings up and maintains userspace services.

When run as a system instance, systemd interprets the configuration file /etc/systemd/system.conf and the files in system.conf.d directories; when run as a user instance, systemd interprets the configuration file /etc/systemd/user.conf and the files in user.conf.d directories.
$ tree /etc/systemd -L 2
/etc/systemd
├── journald.conf
├── logind.conf
├── network
├── resolved.conf
├── system
│   ├── bluetooth.target.wants
│   ├── brltty.service -> /dev/null
│   ├── cloud-final.service.wants
│   ├── dbus-fi.w1.wpa_supplicant1.service -> /lib/systemd/system/wpa_supplicant.service
│   ├── dbus-org.bluez.service -> /lib/systemd/system/bluetooth.service
│   ├── dbus-org.freedesktop.Avahi.service -> /lib/systemd/system/avahi-daemon.service
│   ├── dbus-org.freedesktop.ModemManager1.service -> /lib/systemd/system/ModemManager.service
│   ├── dbus-org.freedesktop.nm-dispatcher.service -> /lib/systemd/system/NetworkManager-dispatcher.service
│   ├── dbus-org.freedesktop.resolve1.service -> /lib/systemd/system/systemd-resolved.service
│   ├── dbus-org.freedesktop.thermald.service -> /lib/systemd/system/thermald.service
│   ├── default.target.wants
│   ├── display-manager.service -> /lib/systemd/system/gdm3.service
│   ├── display-manager.service.wants
│   ├── final.target.wants
│   ├── getty.target.wants
│   ├── graphical.target.wants
│   ├── libvirt-bin.service -> /lib/systemd/system/libvirtd.service
│   ├── multi-user.target.wants
│   ├── network-online.target.wants
│   ├── oem-config.service.wants
│   ├── paths.target.wants
│   ├── printer.target.wants
│   ├── snap-core18-1279.mount
│   ├── snap-core18-1288.mount
│   ├── snap-core-8213.mount
│   ├── snap-core-8268.mount
│   ├── snap-gnome\x2d3\x2d26\x2d1604-97.mount
│   ├── snap-gnome\x2d3\x2d26\x2d1604-98.mount
│   ├── snap-gnome\x2d3\x2d28\x2d1804-110.mount
│   ├── snap-gnome\x2d3\x2d28\x2d1804-91.mount
│   ├── snap-gnome\x2dcalculator-536.mount
│   ├── snap-gnome\x2dcalculator-544.mount
│   ├── snap-gnome\x2dcharacters-367.mount
│   ├── snap-gnome\x2dcharacters-375.mount
│   ├── snap-gnome\x2dlogs-73.mount
│   ├── snap-gnome\x2dlogs-81.mount
│   ├── snap-gnome\x2dsystem\x2dmonitor-111.mount
│   ├── snap-gnome\x2dsystem\x2dmonitor-123.mount
│   ├── snap-gtk\x2dcommon\x2dthemes-1313.mount
│   ├── snap-gtk\x2dcommon\x2dthemes-1353.mount
│   ├── snap-vlc-1049.mount
│   ├── snap-vlc-1397.mount
│   ├── sockets.target.wants
│   ├── spice-vdagentd.target.wants
│   ├── sshd.service -> /lib/systemd/system/ssh.service
│   ├── sysinit.target.wants
│   ├── syslog.service -> /lib/systemd/system/rsyslog.service
│   ├── teamviewerd.service
│   └── timers.target.wants
├── system.conf
├── timesyncd.conf
├── user
│   └── default.target.wants
└── user.conf

  • 舊的 init 啟動腳本是『一項一項任務依序啟動』的模式,因此不相依的服務也是得要一個一個的等待。但目前我們的硬體主機系統與作業系統幾乎都支援多核心架構了, 沒道理未相依的服務不能同時啟動啊!systemd 就是可以讓所有的服務同時啟動,因此你會發現到,系統啟動的速度變快了!
  • systemd 只需要搭配 systemctl 指令來處理
  • systemd 可以自訂服務相依性的檢查,如果 B 服務是架構在 A 服務上面啟動的,systemd 會自動幫你啟動 A 服務
  • 如同 systemV 的 init 裡頭有個 runlevel 的特色,systemd 亦將許多的功能集合成為一個所謂的 target 項目,這個項目主要在設計操作環境的建置, 所以是集合了許多的 daemons,亦即是執行某個 target 就是執行好多個 daemon 的意思
以前,我們如果想要建立系統服務,就得要到 /etc/init.d/ 底下去建立相對應的 bash shell script 來處理。
查看和控制systemd的主要命令是systemctl。該命令可用查看系統狀態和管理系統及服務。
  • Analyzing the system state
    • Show system status
    • 
      $ systemctl status
      
    • List running units
    • 
      $ systemctl
      
      The available unit files can be seen in /usr/lib/systemd/system/ and /etc/systemd/system/ (the latter takes precedence).
    • List failed units
    • 
      $ systemctl --failed
      
    • List installed unit files
    • 
      $ systemctl list-unit-files
      
  • Using units
  • Units can be, for example, services (.service), mount points (.mount), devices (.device) or sockets (.socket). The following list describes the types of units available to systemd:
    • .service
    • A service unit describes how to manage a service or application on the server. This will include how to start or stop the service, under which circumstances it should be automatically started, and the dependency and ordering information for related software.
    • .socket
    • A socket unit file describes a network or IPC socket, or a FIFO buffer that systemd uses for socket-based activation. These always have an associated .service file that will be started when activity is seen on the socket that this unit defines.
    • .device
    • A unit that describes a device that has been designated as needing systemd management by udev or the sysfs filesystem. Not all devices will have .device files. Some scenarios where .device units may be necessary are for ordering, mounting, and accessing the devices.
    • .mount
    • This unit defines a mountpoint on the system to be managed by systemd. These are named after the mount path, with slashes changed to dashes. Entries within /etc/fstab can have units created automatically.
    • .automount
    • An .automount unit configures a mountpoint that will be automatically mounted. These must be named after the mount point they refer to and must have a matching .mount unit to define the specifics of the mount.
    • .swap
    • This unit describes swap space on the system. The name of these units must reflect the device or file path of the space.
    • .target
    • A target unit is used to provide synchronization points for other units when booting up or changing states. They also can be used to bring the system to a new state. Other units specify their relation to targets to become tied to the target’s operations.
    • .path
    • This unit defines a path that can be used for path-based activation. By default, a .service unit of the same base name will be started when the path reaches the specified state. This uses inotify to monitor the path for changes.
    • .timer
    • A .timer unit defines a timer that will be managed by systemd, similar to a cron job for delayed or scheduled activation. A matching unit will be started when the timer is reached.
    • .snapshot
    • A .snapshot unit is created automatically by the systemctl snapshot command. It allows you to reconstruct the current state of the system after making changes. Snapshots do not survive across sessions and are used to roll back temporary states.
    • .slice
    • A .slice unit is associated with Linux Control Group nodes, allowing resources to be restricted or assigned to any processes associated with the slice. The name reflects its hierarchical position within the cgroup tree. Units are placed in certain slices by default depending on their type.
    • .scope
    • Scope units are created automatically by systemd from information received from its bus interfaces. These are used to manage sets of system processes that are created externally.
    When using systemctl, you generally have to specify the complete name of the unit file, including its suffix, for example sshd.socket. If you do not specify the suffix, systemctl will assume .service. For example, netctl and netctl.service are equivalent.
    • Start a unit immediately
    • # systemctl start unit
    • Stop a unit immediately
    • # systemctl stop unit
    • Restart a unit
    • # systemctl restart unit
    • Ask a unit to reload its configuration
    • # systemctl reload unit
    • Show the status of a unit, including whether it is running or not
    • $ systemctl status unit
    • Check whether a unit is already enabled or not
    • $ systemctl is-enabled unit
    • Enable a unit to be started on bootup
    • # systemctl enable unit
    • Enable a unit to be started on bootup and Start immediately
    • # systemctl enable --now unit
    • Disable a unit to not start during bootup
    • # systemctl disable unit
    • Show the manual page associated with a unit (this has to be supported by the unit file)
    • $ systemctl help unit
    • Reload systemd manager configuration, scanning for new or changed units
    • # systemctl daemon-reload

Writing unit files


Linux Server Hacks


Hack#2 Console Logging without password


Using GRUB to invoke bash,
  • Select the appropriate boot entry in the GRUB menu and press e to edit the line.
  • Select the kernel line and press e again to edit it.
  • Append init=/bin/bash at the end of line.
  • init=
    
    Run specified binary instead of /sbin/init as init process.
  • Press Ctrl-X to boot (this change is only temporary and will not be saved to your menu.lst). After booting you will be at the bash prompt.
  • Your root file system is mounted as readonly now, so remount it as read/write
  • mount -n -o remount,rw /
  • Use the passwd command to create a new root password.
  • Mount filesystems manually
  • /etc/fstab is not read because init is not executed now. Use "mount -a" to show the content of /etc/fstab, mount the necessary filesystems manually.
  • Reboot by typing reboot -f and do not lose your password again!

Hack#5 n&m


Each file in Linux has a corresponding File Descriptor associated with it.
The Bourne shell operator n&m rearranges the files and files operators.
This makes descriptor n point to the files of descriptor m.
Ex.,
  • output including error
  • 
    $ ls test* test.none
    ls: cannot access 'test.none': No such file or directory
    test.c  testc.cpp  test.cpio.gz  test.cpp  testelf.c  test.sh    
        
  • re-direct the standard output
  • 
    $ ls test* test.none 1>'./stdout'
    ls: cannot access 'test.none': No such file or directory
    $ cat ./stdout
    test.c
    testc.cpp
    test.cpio.gz
    test.cpp
    testelf.c
    test.sh
        
  • re-direct the standard error to the standard output
  • 
    $ ls test* test.none 1>'./stdout' 2>&1
    $ cat ./stdout
    ls: cannot access 'test.none': No such file or directory
    test.c
    testc.cpp
    test.cpio.gz
    test.cpp
    testelf.c
    test.sh
        
        
    The standards error is re-directed to the standard output which has been re-directed to a file.
POSIX definition of Redirection Operator:
In the shell command language, a token that performs a redirection function.
Redirections are processed in the order they appear, from left to right.
It is one of the following symbols:
  • command < file.txt
  • Gives input to a command.
  • command <> file.txt
  • Gives input to a command. If the file doesn't exist, it will be created.
  • command > file.txt
  • Directs the output of a command into a file.
  • command >| file.txt
  • Does the same as >, but will overwrite the target
  • command << WORD
  • A here document.
    
     command << WORD
         Text
     WORD    	
        
    Everything before the WORD (Text) will be the input to the command.
  • command >> file.txt
  • Does the same as >, except that if the target file exists, the new data are appended.
  • command <&num
  • a file descriptor number num as the standard input
  • command >&num
  • a file descriptor number num as the standard output
  • command <<-
  • command <>

Linux disk and filesystem management


/etc/fstab


The file fstab contains descriptive information about the filesystems the system can mount.
Each filesystem is described on a separate line. Fields on each line are separated by tabs or spaces.

(file system) (mount point)   (type)  (options)       (dump)  (pass)

  • file system
  • the block device (/dev/xxx) or remote filesystem (host:dir) to be mounted.
    LABEL=label or UUID=uuid may be given instead of a device name.
  • mount point
  • directory
  • type
  • filesystem type
  • options
  • It is formatted as a comma-separated list of options.
  • dump
  • Defaults to zero(don't dump) if not present. Used by dump to determine which filesystems need to be dumped, this is not really used today anymore.
  • pass
  • This field is used by fsck to determine the order in which filesystem checks are done at boot time. The root filesystem should be specified with 1. Other filesystems should have 2. Defaults to zero (don't fsck) if not present.

/etc/fstab is a list of filesystems to be mounted at boot time.
/etc/mtab is a list of currently mounted filesystems.

How to determine/find UUID of a partition?


In Linux, UUID(Universally Unique Identifier) can identify media more accurately and reliable, identify media via /dev/hdxy or /dev/sdxy is not a good method because the order may be different between boots, so it was no longer preferred any more, especially in fstab or grub config.

libuuid is part of the util-linux-ng package since kernel version 2.15.1 and it’s installed by default in Linux system.
The UUIDs generated by this library can be reasonably expected to be unique within a system, and unique across all systems.
UUIDs are represented as 32 hexadecimal (base 16) digits, displayed in five groups separated by hyphens, in the form 8-4-4-4-12 for a total of 36 characters (32 alphanumeric characters and four hyphens).


How to find UUIDs of my hard disk partitions?


~$ tree /dev/disk/
/dev/disk/
├── by-id
│   ├── ata-TEAC_DVD-ROM_DV18SA_10091725083237 -> ../../sr0
│   ├── ata-WDC_WD2500BEKT-75A25T0_WD-WXQ1A80V7620 -> ../../sda
│   ├── ata-WDC_WD2500BEKT-75A25T0_WD-WXQ1A80V7620-part1 -> ../../sda1
│   ├── wwn-0x50014ee655d0b10e -> ../../sda
│   └── wwn-0x50014ee655d0b10e-part1 -> ../../sda1
├── by-partuuid
│   └── abfa7e81-01 -> ../../sda1
├── by-path
│   ├── pci-0000:00:1f.2-ata-1 -> ../../sda
│   ├── pci-0000:00:1f.2-ata-1-part1 -> ../../sda1
│   └── pci-0000:00:1f.2-ata-2 -> ../../sr0
└── by-uuid
    └── 3db7ffaf-51bc-4f72-a09d-5ec2f3904c08 -> ../../sda1

$ sudo blkid
[sudo] password for jerry: 
/dev/loop0: TYPE="squashfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"
/dev/loop4: TYPE="squashfs"
/dev/loop5: TYPE="squashfs"
/dev/loop6: TYPE="squashfs"
/dev/loop7: TYPE="squashfs"
/dev/sda1: UUID="3db7ffaf-51bc-4f72-a09d-5ec2f3904c08" TYPE="ext4" PARTUUID="abfa7e81-01"
/dev/loop8: TYPE="squashfs"
/dev/loop9: TYPE="squashfs"
/dev/loop10: TYPE="squashfs"
/dev/loop11: TYPE="squashfs"
/dev/loop12: TYPE="squashfs"

$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
#                
# / was on /dev/sda1 during installation
UUID=3db7ffaf-51bc-4f72-a09d-5ec2f3904c08 /               ext4    errors=remount-ro 0       1
/swapfile                                 none


How to generate a new UUID for a partition?


tune2fs allows the system administrator to adjust various tunable filesystem parameters on Linux ext2, ext3, or ext4 filesystems.

tune2fs  [  -l  ]  [ -c max-mount-counts ] [ -e errors-behavior ] [ -f ] [ -i interval-between-checks ] [ -I
       new_inode_size ] [ -j ] [ -J journal-options ] [ -m reserved-blocks-percentage ] [ -o [^]mount-options[,...]
       ] [ -r reserved-blocks-count ] [ -u user ] [ -g group ] [ -C mount-count ] [ -E extended-options ] [ -L vol‐
       ume-label ] [ -M last-mounted-directory ] [ -O [^]feature[,...]  ] [ -Q  quota-options  ]  [  -T  time-last-
       checked ] [ -U UUID ] [ -z undo_file ] device

The device specifier can either be a filename (i.e., /dev/sda1), or a LABEL or UUID specifier: "LABEL=volume-label" or "UUID=uuid". (i.e., LABEL=home or UUID=e40486c6-84d5-4f2f-b99c-032281799c9d).


~$ sudo tune2fs -U random /dev/sda1

tune2fs 1.44.1 (24-Mar-2018)
The UUID may only be changed when the filesystem is unmounted.

And in /etc/fstab, we should modify it to use the new UUID.

LinuxServer hacks

Hack#8 Immutable Files in ext2/ext3


Linux provides us the access control by file and directory permissions on three levels: user, group and other. These file permissions provide the basic level of security and access control.

The umask utility is used to control the file-creation mode mask, which determines the initial value of file permission bits for newly created files. Because umask affects the current shell execution environment, it is usually implemented as built-in command of a shell.

$ umask
0022
$ umask -S
u=rwx,g=rx,o=rx

Linux also has advanced access control features like ACLs (Access Control Lists) and attributes. Attributes define properties of files.
a: append only
c: compressed
d: no dump
e: extent format
i: immutable
j: data journalling
s: secure deletion
t: no tail-merging
u: undeletable
A: no atime updates
C: no copy on write
D: synchronous directory updates
S: synchronous updates
T: top of directory hierarchy
For ex., a file with the i attribute cannot be modified.
It cannot be deleted or renamed, no link can be created to this file and no data can be written to the file.
When set, prevents, even the superuser, from erasing or changing the contents of the file.

Some Linux-native filesystems support several attributes that you can adjust with the chattr command and use lsattr command to display the attributes of a file.
  • chattr
  • 
     chattr [-RVf] [-+=AacDdijsTtSu] [-v version] files...
    
  • lsattr
  • 
     lsattr [ -RVadv ] [ files...  ]
    

Hack#11 Finding and Eliminating setuid/setgid Binaries


Linux uses a combination of bits to store the permissions of a file. We can change the permissions using the chmod command, which essentially changes the ‘r’, ‘w’ and ‘x’ characters associated with the file.

$ chmod u=rwx filename
$ chmod go=rx filename
$ chmod g+w foobar
$ chmod a-w foobar
Note, a: all; use this instead of typing "ugo".

Further, the ownership of files also depends on the uid (user ID) and the gid (group ID) of the creator, when we launch a process, it runs with the uid(effective user-id) and gid (effective group-id) of the user who launched it.
  • The setuid bit
  • When the setuid bit is set, it will set its permissions to that of the user who created it (owner), instead of setting it to the user who launched it. To locate the setuid, look for an ‘s’ instead of an ‘x’ in the executable bit of the file permissions.An example of an executable with setuid permission is passwd,
    
    $ ls -l /usr/bin/passwd
    -rwsr-xr-x 1 root root 59640  三  23  2019 /usr/bin/passwd
    
    
    This means that passwd is executed with the permissions for its creator root. If a vulnerable program runs with root privileges, the attacker could gain root access to the system through it. To find all setuid files :
    
    find /usr/bin -perm -u+s -type f -print | xargs ls -ld
    
    
  • The setgid bit
  • The setgid affects both files as well as directories. When used on a file, it executes with the privileges of the group of the user who owns it. When the bit is set for a directory, the set of files in that directory will have the same group as the group of the parent directory. This is used for directory sharing. To locate the setgid bit, look for an ‘s’ in the group section of the file permissions. To find all setgid files :
    
    find /usr/bin -perm -g+s -type f -print | xargs ls -ld
    
    
  • The sticky bit
  • The sticky bit has no effect on files, when used on a directory, all the files inside the directory will be modifiable only by their owners, directory owner and the root user. Without the sticky bit set, any user with write and execute permissions for the directory can rename or delete contained files, regardless of the file's owner. Typically, this is set on the /tmp directory to prevent ordinary users from deleting or moving other users' files. The sticky bit is identifiable by a t which is reported where normally the executable x bit is shown, in the "other" section. Again, a lowercase t implies that the executable bit is also present, otherwise you would see a capital T.
    drwxrwxrwt  22 root root       4096  一   1 12:35 tmp
    
    

Therefore,
  • setuid ==> personal file access permission forwarding
  • setgid ==> group file access permission forwarding sharing + parent group directory sharing
  • sticky bit ==> directory sharing without file sharing

FilePermissionsACLs


ACLs are a second level of discretionary permissions, that may override the standard ugo/rwx ones. ACLs are used to define more fine-grained discretionary access control for files and directories.
An ACL consists of entries specifying access permissions on an associated object. ACLs can be configured per user, per group or via the effective rights mask.
  • Enabling ACLs in the Filesystem
  • The file system must be mounted with ACLs turned on. To check if it is on,
    
    $ sudo tune2fs -l /dev/sda1 | grep acl
    Default mount options:    user_xattr acl
    
    
    the partition(s) on which you want to enable ACL can be set in /etc/fstab.
    
     ...
     UUID=07aebd28-24e3-cf19-e37d-1af9a23a45d4    /home    ext4   defaults,acl   0   2
     ...
    
  • Set ACL by user
  • List the ACLs for a given file or directory.
    
    $ sudo touch testAcl.txt
    $ ll testAcl.txt
    -rw-r--r-- 1 root root 0  一  12 14:33 testAcl.txt
    
    # confirm settings
    $ getfacl testAcl.txt
    # file: testAcl.txt
    # owner: root
    # group: root
    user::rw-
    group::r--
    other::r--
    
    # try to write with "jerry"
    $ echo "acl" >  testAcl.txt
    bash: testAcl.txt: Permission denied
    
    # try to read with "jerry"
    $ cat  testAcl.txt
    
    # set write for "jerry"
    $ sudo setfacl -m u:jerry:w  testAcl.txt
    $ ll testAcl.txt
    -rw-rw-r--+ 1 root root 0  一  12 14:33 testAcl.txt
    $ getfacl testAcl.txt
    # file: testAcl.txt
    # owner: root
    # group: root
    user::rw-
    user:jerry:-w-
    group::r--
    mask::rw-
    other::r--
    
    

Hack#12 Make sudo Work hard


sudo allows a permitted user to execute a command as the superuser or another user.
sudo is a setuid program.

$ ls -l /usr/bin/sudo
-rwsr-xr-x 1 root root 149080  十  11 02:32 /usr/bin/sudo

sudo command works in conjunction with security policies, default security policy is sudoers and it is configurable via /etc/sudoers file.

Hack#16 Fun with /proc


The /proc filesystem contains a representation of kernel's live process table.
The directories named by numbers contains information for every processes running on the Linux. The number corresponds to the PID.
Take a look at the structure for each process, some are useful:
  • cwd
  • exe
  • cmdline
  • environ

Hack#18 Manage System resources Per Process


Bash provides a utility "ulimit".
“ulimit” isn’t a separate binary. It’s embedded into the shell itself.

$ ulimit
unlimited
The above means that the current user has “unlimited” amount of resources to be accessed.
To get the report in details,

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15274
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 15274
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

All the limits that’ll be applicable to the users are defined in:

/etc/security/limits.conf

Partitioning


Partitioning a hard drive divides the available space into sections that can be accessed independently. An entire drive may be allocated to a single partition, or multiple ones for cases such as dual-booting, maintaining a swap partition, or to logically separate data such as audio and video files.

The required information is stored in a partition table scheme such as MBR or GPT.


MBR


The Master Boot Record (MBR) is the first 512 bytes of a storage device. It contains an operating system bootloader and the storage device's partition table. It plays an important role in the boot process under BIOS systems.

Note: The MBR is not located in a partition; it is located at the first sector(usually 512 bytes) of the device (physical offset 0), preceding the first partition.

  • bootstrap code
  • The first 440 bytes of MBR are the bootstrap code area. The bootstrap code can be backed up, restored from backup or erased using dd.
  • partition table
  • There are 3 types of partitions:
    • Primary
    • Primary partitions can be bootable and are limited to four partitions per disk
    • Extended
    • A hard disk can contain no more than one extended partition. The extended partition is also counted as a primary partition so if the disk has an extended partition.
      • Logical
每個partition table採用兩種描述方式:使用柱面/磁頭/扇區 (CHS) 的表示法,以及邏輯塊尋址 (LBA) 的表示法。CHS 表示法已不再使用,因為它局限於 24 位,僅可滿足 8 GB 的磁碟空間。32 位的 LBA 值可滿足多達 2TB(主要指一個sector為512 bytes的硬碟。對於單扇區大小為4K的大容量硬碟來說,這個限制已經增大到了16TB.). MBR 的另一個缺點在於,硬碟的總空間最多只能劃分為四個partition table。如果硬碟上需要超過四個partition table,則需要創建名為extended partition的特殊partition。The extended partition結構沒有任何形式的錯誤檢測功能,因此難以定位損壞。

GPT


The GUID partition table (GPT) partitioning scheme was introduced by Intel as part of an effort to introduce more modern firmware to generic PC hardware.
GPT is part of the Unified Extensible Firmware Interface (UEFI) specification; it uses globally unique identifiers (GUIDs), or UUIDs in the Linux world, to define partitions and partition types.
與MBR相比,GPT(Globally Unique Identifier Partition Table ,GUID Partition Table, 縮寫:GPT)磁碟分割區樣式支援最大為128個分割,一個分割最大18 EB(Exabytes)
The high-level summary of the block layout used by GPT:

BlockDescription
0Protective MBR
1Partition Table Header (primary)
2 through 2+b-1Partition Entry Array (primary)
2+b through n-2-bpartition data
n-2-b+1 through n-2Partition Entry Array (backup)
n-1Partition Table Header (backup)


  • Protective MBR
  • At the start of a GPT disk there is a protective Master Boot Record (PMBR) to protect against GPT-unaware software. This protective MBR just like an ordinary MBR has a bootstrap code area which can be used for BIOS/GPT booting with boot loaders that support it. A GPT-unaware program sees the GPT disk as an MBR disk with a single, unknown partition.
  • Partition Table Header (primary)
  • A structure that defines various aspects of the disk:
    • a GUID to uniquely identify the disk
    • the starting block of the partition entry array
    • the size of each partition entry in that array
  • Partition Entry Array (primary)
  • An array of partition entries, each of which defines a partition (or is all zero, indicating that the entry is not in use). This array is treated as an array of bytes. The first partition entry start at the first byte of the array, the next partition entry follows immediately after that, and so on. The size of these entries is given by a field in the partition table header. Each partition entry contains:
    • a GUID to uniquely identify the partition itself
    • a GUID to identify the partition type
    • the start and end block of the partition
    • the partition name
  • partition data
  • Partition Entry Array (backup)
  • Partition Table Header (backup)

Tools and Usages


  • Check for an existing partition
  • 
    $ sudo fdisk -l /dev/sdb
    Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0x00027921
    
    Disklabel type indicates what partition table is applied: dos(for MBR) or gpt.
  • Create new partition table
  • 
    $ sudo parted /dev/sdb
    GNU Parted 3.2
    Using /dev/sdb
    Welcome to GNU Parted! Type 'help' to view a list of commands.
    
    • create a new GPT partition table
    • 
      (parted) mklabel gpt  
      
    • create a new MBR partition table
    • 
      (parted) mklabel mbr
      
  • Create new partitions
  • 
     mkpart part-type fs-type start end
    
    where
    • part-type
    • This is meaningful only for MBR partition tables:
      • primary
      • extended
      • logical
    • fs-type
    • This can be listed by entering help mkpart.
    • start
    • the beginning of the partition from the start of the device
    • end
    • the end of the partition from the start of the device
    Both start and end consist of a number followed by a unit:
    • s
    • sector (n bytes depending on the sector size, often 512)
    • MB
    • megabyte (1000000 bytes)
    • GB
    • gigabyte (1000000000 bytes)
    • %
    • percentage of the device (between 0 and 100)
    If you do not want to leave unused space in the device, make sure that each partition starts where the previous one ends.
    
    (parted) mkpart fat32 0GB 30GB
    (parted) print                                                            
    Model: Seagate Portable (scsi)
    Disk /dev/sdb: 1000GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    Disk Flags: 
    
    Number  Start   End     Size    File system  Name   Flags
     1      1049kB  30.0GB  30.0GB               fat32
    
    (parted) mkpart ext4 30GB 250GB                                           
    (parted) print                                                           
    Model: Seagate Portable (scsi)
    Disk /dev/sdb: 1000GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    Disk Flags: 
    
    Number  Start   End     Size    File system  Name   Flags
     1      1049kB  30.0GB  30.0GB               fat32
     2      30.0GB  250GB   220GB                ext4
    
    
    
  • delete a partition
  • 
    rm PartitionNumber
    
To get the partition information,

$ sudo blkid
/dev/sdb1: UUID="EC2E-4699" TYPE="vfat" PARTLABEL="fat32" PARTUUID="a884291d-6558-4acb-9c2e-8d56ba0cbf21"
/dev/sdb2: UUID="99349696-e669-4ec9-8f7c-475ea4e97102" TYPE="ext4" PARTLABEL="ext4" PARTUUID="ec6f13f8-45a5-4e74-9f91-fdd1741fc3b6"


Booting Process


  • BIOS Booting
  • The BIOS firmware will be told which disk to boot the system from. It executes the bootloader it finds in the MBR of the specified disk, and that’s it. The firmware is no longer involved in booting. The BIOS firmware layer doesn’t really know what a bootloader is, or what an operating system is. All it can do is run the boot loader from a disk’s MBR.
    • bootloader is a Linux loader
    • GRUB does not fit in 440 bytes, the size of the Master Boot Record. Therefore, the bootstrap code that is loaded actually just parses the partition table, finds the /boot partition, and parses the filesystem information, it then loads Stage 2 GRUB. Stage 2 GRUB loads everything it needs, including the GRUB configuration, then presents a menu (or not, depending on user configuration). After a boot sequence is chosen, the Linux loader knows where the kernel file is and will load Linux kernel in RAM and execute it.
    • bootloader is not a Linux Loader
    • The bootstrap code loads the MBR of the active partition where the Linux loader is installed in.
    Both of the above 2 methods load the kernel file by the call to BIOS' INT13. This requires that the Linux kernel file needs to be install in the 1st 1024 cylinders. BIOS systems that are partitioned with GPT require a BIOS boot partition if GRUB is used as the bootloader.
  • UEFI Booting
  • UEFI stands for Unified Extensible Firmware Interface. It's a standard specification for the firmware interface on a computer. UEFI systems require an EFI system partition. The EFI system partition or ESP is a partition on a data storage device that is used by computers adhering to the UEFI. When a computer is booted, UEFI firmware loads files stored on the ESP to start installed operating systems and various utilities. The EFI system partition is formatted with a file system whose specification is based on the FAT file system and maintained as part of the UEFI specification. Both GPT- and MBR-partitioned disks can contain an EFI system partition, as UEFI firmware is required to support both partitioning schemes. UEFI provides backward compatibility with legacy systems by reserving the first block (sector) of the partition for compatibility code, the first sector of a partition is loaded into memory and execution is transferred to this code on legacy BIOS-based system. Many UEFI firmwares can boot a system just like a BIOS firmware would – they can look for an MBR on a disk, and execute the boot loader from that MBR, and leave everything subsequently up to that bootloader. Regular UEFI boot has several lists of possible boot entries, stored in UEFI config variables (normally in NVRAM), and boot order config variables stored alongside them. It allows for many different boot options, and a properly-defined fallback order. In many cases, you can even list and choose which OS / boot loader to use from the system boot menu (similar to the boot device menu implemented in many BIOSes). The boot sequence for UEFI consists of the following:
    • The boot order list is read from a globally defined NVRAM variable. Modifications to this variable are only guaranteed to take effect after the next platform reset. The boot order list defines a list of NVRAM variables that contain information about what is to be booted. Each NVRAM variable defines a name for the boot option that can be displayed to a user.
    • The variable also contains a pointer to the hardware device and to a file on that hardware device that contains the UEFI image to be loaded.
    • The variable might also contain paths to the OS partition and directory along with other configuration specific directories

Drive and partition backups with dd


The dd is an command which stands for “data duplicator”.
Note that dd copies "empty" space too, so if the partition is 200MB in size, even if it only contains 100MB of data, the output file will be 200MB in size.
  • One of the most typical use cases for the utility is the backup of the MBR.
  • To backup the MBR of /dev/sda disk,
    $ sudo dd if=/dev/sda bs=512 count=1 of=mbr.img
    
  • make an exact clone of the /boot partition to a backup file
  • 
    # dd if=/dev/sda1 of=/srv/boot.img
    
  • partitions can be restored from these backup copies
  • 
    # dd if=/srv/boot.img of=/dev/sda1
    
  • duplicate an existing drive to another
    • use fdisk to recreate the appropriately-sized partitions
    • 
      sfdisk -d /dev/sda | sfdisk /dev/sdb
      
    • make sure the partitions on /dev/sdb match those on /dev/sda
    • 
      # fdisk -l /dev/sda; fdisk -l /dev/sdb
      
    • copy the MBR from the first disk to the second
    • 
      # dd if=/dev/sda of=/dev/sdb bs=446 count=1
      
    • copy each partition
    • 
      # dd if=/dev/sda1 of=/dev/sdb1
      # dd if=/dev/sda2 of=/dev/sdb2
      

Understanding File System Superblock in Linux


Blocks in File System


Hard disk sector is a basic storage unit of the drive.
When a partition or disk is formatted, the sectors in the hardisk is first divided into small groups. This groups of sectors is called as blocks. Block size of file systems is a software construct.
A linux Kernal performs all its operations on a file system using block size of the file system. The block size can never be smaller than the hard disk's sector size, and will always be in multiple of the hard disk sector size. The linux Kernel also requires the file system block size to be smaller or equal to the system page size. Use getconf to query system configuration variables:

$ getconf PAGE_SIZE
4096

The block size is something that can be specified when a user formats a partition using the command line parameters available.

mkfs -t ext3 -b 4096 /dev/sda1

The block size you select will impact the following things:
  • Maximum File Size
  • Maximum File System Size
  • Performance
The ext file system divides the entire space of the partition to equal sized block groups. Block groups are for ease of access during reads and writes. Number of blocks per group is fixed, and cannot be changed.
The layout of a standard block group is approximately as follows:

For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes.

Superblock


The superblock records various information about the enclosing filesystem, such as block counts, inode counts, supported features, maintenance information, and more.
Similar to how i-nodes stores metadata of files, superblocks store metadata of the filesystem.
The superblock information of an existing file system can be viewed by using dumpe2fs,

$ sudo dumpe2fs -h /dev/sda1
[sudo] password for jerry: 
dumpe2fs 1.44.1 (24-Mar-2018)
Filesystem volume name:   
Last mounted on:          /
Filesystem UUID:          3db7ffaf-51bc-4f72-a09d-5ec2f3904c08
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              15269888
Block count:              61049344
Reserved block count:     3052467
Free blocks:              47122648
Free inodes:              14341481
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Jun 24 16:11:33 2019
Last mount time:          Tue Dec 31 16:02:30 2019
Last write time:          Tue Dec 31 16:02:23 2019
Mount count:              102
Maximum mount count:      -1
Last checked:             Mon Jun 24 16:11:33 2019
Check interval:           0 ()
Lifetime writes:          550 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
First orphan inode:       7837802
Default directory hash:   half_md4
Directory Hash Seed:      d9d5e4cd-c2d0-491c-88aa-1762b2295bb1
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xcb19d304
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x0019c036
Journal start:            131623
Journal checksum type:    crc32c
Journal checksum:         0x0f28cdd7


STORAGE ADMINISTRATION GUIDE

2.3. THE /PROC VIRTUAL FILE SYSTEM

The following /proc files are relevant in managing and monitoring system storage:
  • /proc/devices
  • Displays various character and block devices that are currently configured.
  • /proc/filesystems
  • Lists all file system types currently supported by the kernel.
  • /proc/mdstat
  • Contains current information on multiple-disk or RAID configurations on the system, if they exist.
  • /proc/mounts
  • Lists all mounts currently used by the system.
  • /proc/partitions
  • Contains partition block allocation information.


CHAPTER 18. USING THE MOUNT COMMAND

  • Listing Currently Mounted File Systems
  • 
    mount        
            
  • To display only the devices with a certain file system type
  • 
    mount -t ext4        
            
  • MOUNTING A FILE SYSTEM
  • 
    mount [option…] device directory        
            
    The device can be identified by :
    • a full path to a block device
    • /dev/sda3”
    • a universally unique identifier
    • UUID=34795a28-ca6d-4fd8-a347-73671d0c19cb”
    • a volume label
    • LABEL=home”
  • SHARING MOUNTS
  • The mount command implements the --bind option that provides a means for duplicating certain mounts.
    
    mount --bind old_directory new_directory        
            
    This allows the file system under old_directory can be accessed via new_directory.
    If the new_directory is under the old_directory, use
    
    mount --rbind old_directory new_directory        
            



6.4. BACKUP EXT2/3/4 FILE SYSTEMS

If the partition being backed up is an operating system partition, bootup your system into Single User Mode. Use dump to backup the contents of the partitions:

# dump -0uf /backup-files/sda1.dump /dev/sda1
Note:
  • If the system has been running for a long time, it is advisable to run e2fsck on the partitions before backup.
  • dump should not be used on heavily loaded and mounted filesystem as it could backup corrupted version of files.
-level#
              The dump level (any integer). A level 0, full backup,  specified  by  -0  guarantees  the
              entire  file system is copied.

-f file
              Write the backup to file

-u     Update the file /var/lib/dumpdates after a successful dump. The format of  /var/lib/dump‐
              dates  is  readable  by people

6.5. RESTORE AN EXT2/3/4 FILE SYSTEM

If you are restoring an operating system partition, bootup your system into Rescue Mode.
  • Format the destination partitions by using the mkfs command
  • Prepare the working directories.
  • 
    # mkdir /mnt/sda1
    # mount -t ext3 /dev/sda1 /mnt/sda1
    
  • Restore the data.
  • restore restores files or file systems from backups made with dump.
    
    # cd /mnt/sda1
    # restore -rf /backup-files/sda1.dump
    

12.2. FILESYSTEM-SPECIFIC INFORMATION FOR FSCK


The generic 'fsck' command will attempt to detect the filesystem type, or it will accept parameters specifying the type. 'e2fsck' is essentially a shortcut saying it's an ext2 filesystem. They all behave the same way and check the filesystem for errors.
fsck is simply a front end that calls the appropriate tool for the filesystem in question,

lrwxrwxrwx 1 root root      8  六  24  2019 /sbin/dosfsck -> fsck.fat
-rwxr-xr-x 1 root root 314080  九  27 02:01 /sbin/e2fsck
-rwxr-xr-x 1 root root  47232  八  23 07:47 /sbin/fsck
-rwxr-xr-x 1 root root  34928  八  23 07:47 /sbin/fsck.cramfs
lrwxrwxrwx 1 root root      6  九  27 02:01 /sbin/fsck.ext2 -> e2fsck
lrwxrwxrwx 1 root root      6  九  27 02:01 /sbin/fsck.ext3 -> e2fsck
lrwxrwxrwx 1 root root      6  九  27 02:01 /sbin/fsck.ext4 -> e2fsck
-rwxr-xr-x 1 root root  59472  一  25  2017 /sbin/fsck.fat
-rwxr-xr-x 1 root root  92264  八  23 07:47 /sbin/fsck.minix
lrwxrwxrwx 1 root root      8  六  24  2019 /sbin/fsck.msdos -> fsck.fat
lrwxrwxrwx 1 root root      8  六  24  2019 /sbin/fsck.vfat -> fsck.fat


If these filesystems encounter metadata inconsistencies while mounted, they will record this fact in the filesystem superblock. If e2fsck finds that a filesystem is marked with such an error, e2fsck will perform a full check .

CHAPTER 13. PARTITIONS

parted is a program to manipulate disk partitions.

$ sudo parted -l 
[sudo] password for jerry: 
Model: ATA WDC WD2500BEKT-7 (scsi)
Disk /dev/sda: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End    Size   Type     File system  Flags
 1      1049kB  250GB  250GB  primary  ext4         boot

CHAPTER 14. LVM (LOGICAL VOLUME MANAGER)

LVM2, can be used to gather existing storage devices into groups and allocate logical units from the combined space as needed. Physical volumes are regular storage devices. LVM writes a header to the device to allocate it for management. LVM combines physical volumes into storage pools known as volume groups. A volume group can be sliced up into any number of logical volumes. Logical volumes are functionally equivalent to partitions on a physical disk, but with much more flexibility. In summary, LVM can be used to combine physical volumes into volume groups to unify the storage space available on a system. Afterwards, administrators can segment the volume group into arbitrary logical volumes, which act as flexible partitions. Each volume within a volume group is segmented into small, fixed-size chunks called extents. The extents on a physical volume are called physical extents, while the extents of a logical volume are called logical extents. A logical volume is simply a mapping that LVM maintains between logical and physical extents.

To use LVM, the first step is to scan the system for block devices that LVM can see and manage.

$ sudo lvmdiskscan | grep sd
  /dev/sda1   [     232.88 GiB] 
  /dev/sdb1   [     <27.94 GiB] 
  /dev/sdb2   [     204.89 GiB] 
  /dev/sdb3   [     <46.57 GiB] 

Warning: Make sure that you double-check that the devices you intend to use with LVM do not have any important data already written to them. Using these devices within LVM will overwrite the current contents.

We can mark these 2 physical partitions as physical volumes within LVM using the pvcreate command:

$ sudo pvcreate /dev/sdb2 /dev/sdb3
WARNING: ext4 signature detected on /dev/sdb2 at offset 1080. Wipe it? [y/n]: y
  Wiping ext4 signature on /dev/sdb2.
WARNING: ext4 signature detected on /dev/sdb3 at offset 1080. Wipe it? [y/n]: y
  Wiping ext4 signature on /dev/sdb3.
  Physical volume "/dev/sdb2" successfully created.
  Physical volume "/dev/sdb3" successfully created.
pvcreate initializes a physical volume(PV) so that it is recognized as belonging to LVM, and allows the physical volume to be used in a volume group(VG). A PV can be a disk partition, whole disk, meta device, or loopback file. We can use pvs to display information about physical volumes,

$ sudo pvs
  PV         VG Fmt  Attr PSize   PFree  
  /dev/sdb2     lvm2 ---  204.89g 204.89g
  /dev/sdb3     lvm2 ---  <46.57g <46.57g

PVs can be removed:

$ sudo pvremove  /dev/sdb2 /dev/sdb3
  Labels on physical volume "/dev/sdb2" successfully wiped.
  Labels on physical volume "/dev/sdb3" successfully wiped.

Creating Volume Groups:

$ sudo vgcreate vg1 /dev/sdb2 /dev/sdb3
  Physical volume "/dev/sdb2" successfully created.
  Physical volume "/dev/sdb3" successfully created.
  Volume group "vg1" successfully created

We can see a brief summary of the volume group,

$sudo vgs
  VG  #PV #LV #SN Attr   VSize   VFree  
  vg1   2   0   0 wz--n- 251.45g 251.45g

Currently, there has two physical volumes, zero logical volumes, and has the combined capacity of the underlying devices. We can use VG as a pool that we can allocate logical volumes from. You can use vgcreate to create a new VG on the PV, or vgextend to add the PV to existing VG. To create logical volumes, we use the lvcreate command:
  • pass in the volume group to pull from
  • name the logical volume with the -n option
  • specify the size with the -L option
We can create the first three logical volumes like this:

$ sudo lvcreate -L 10G -n projects vg1
  Logical volume "projects" created.
$ sudo lvcreate -L 5G -n www vg1
  Logical volume "www" created.
$ sudo lvcreate -L 20G -n db vg1
  Logical volume "db" created.

We can see the logical volumes and their relationship to the volume group,

$ sudo vgs -o +lv_size,lv_name
  VG  #PV #LV #SN Attr   VSize   VFree   LSize  LV      
  vg1   2   3   0 wz--n- 251.45g 216.45g 10.00g projects
  vg1   2   3   0 wz--n- 251.45g 216.45g  5.00g www     
  vg1   2   3   0 wz--n- 251.45g 216.45g 20.00g db   

Now, we can allocate the rest of the space in the volume group to the “workspace” volume using the -l flag,

$ sudo lvcreate -l 100%FREE -n workspace vg1
  Logical volume "workspace" created.

$ sudo vgs -o +lv_size,lv_name
  VG  #PV #LV #SN Attr   VSize   VFree LSize   LV       
  vg1   2   4   0 wz--n- 251.45g    0   10.00g projects 
  vg1   2   4   0 wz--n- 251.45g    0    5.00g www      
  vg1   2   4   0 wz--n- 251.45g    0   20.00g db       
  vg1   2   4   0 wz--n- 251.45g    0  216.45g workspace

As you can see, the “vg1” volume group is completely allocated. The logical volume devices are available within the /dev directory just like other storage devices. You can access them in two places:

/dev/volume_group_name/logical_volume_name
/dev/mapper/volume_group_name-logical_volume_name
and format logical volumes with the Ext4 filesystem,

$ sudo mkfs.ext4 /dev/vg1/projects
$ sudo mkfs.ext4 /dev/vg1/www
$ sudo mkfs.ext4 /dev/vg1/db
$ sudo mkfs.ext4 /dev/vg1/workspace

Linux Server Hacks, Volume Two: Storage Management and Backups

#46 Create Flexible Storage with LVM

Logical volumes, which are filesystems that appear to be single volume but are actually assembled from space that has been allocated on multiple physical partitions. The size of a logical volume can exceed the size of any of the physical storage devices on your system, but it cannot exceed the sum of all of their sizes.

Linux process management, user management and package management

Process Management


Every process has 6 or more IDs associated with it:
  • real User ID ans real group ID
  • The real ID of the executor. Only the superuser can change the real IDs.
  • effective user ID, effective group ID, supplementaty group ID
  • Determine the file access permission. If set-user-ID or set-group-ID bits of a file are set, the effective user ID or the effective group ID will be set to the owner of the file.
  • saved set-user-ID and set-group-ID
  • These are copied from the effective IDs by exec.

A process refers to a program in execution; it’s a running instance of a program.
The only way a new process is created by the kernel is when an existing process calls the fork() function.
The new process created by fork is called the child process.
Both the child and parent continue executing with the instruction that follows the call to fork. The child is a copy of its parent:
  • data
  • heap
  • stack
The parent and child do not share these positions of memory.
A fork is often followed by an exec.
The child process will have the same environment as its parent, but only the process ID number is different. A executed program is identified by its process ID (PID) as well as its parent processes ID (PPID). You can use the pidof command to find the ID of a process:

$ pidof init
1

To find the process ID and parent process ID of the current shell, run:

$ echo $$
$ echo $PPID


If we have an application that can only do one thing at a time, it is really bad. This is where thread steps in.

A process can have multiple threads.
Meaning threads will be part of a process (all threads of the same process will share same PID).

In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space(not a copy). Processes run in separate virtual memory spaces.
A thread is a path of execution within a process. Threads share with other threads their code section, data section, and OS resources (like open files and signals). But, like process, a thread has its own program counter (PC), register set, and stack space.

The low level interface to create threads is the clone() system call. The higher level interface is pthread_create().

#include <pthread.h>

       int pthread_create(pthread_t *thread, const pthread_attr_t *attr,
                          void *(*start_routine) (void *), void *arg);

       Compile and link with -pthread.
The pthread_create() function starts a new thread in the calling process.
The new thread starts execution by invoking start_routine(); arg is the only one argument passed start_routine().

Why threads?. This is because communication between processes is not simple:
  • Some overhead involved and is comparatively slow.
  • Context switching between threads are faster compared to switching between processes.
Although you can use clone() system call to create a thread, it is recommended to use pthread_create(). This is for portability reasons.
In Linux, these thread id numbers are indicated by LWP (Light Weight Process) , the ps command column name is also LWP:

$ ps -efL
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
root         1     0     1  0    1 十二23 ?     00:00:10 /sbin/init splash
root         2     0     2  0    1 十二23 ?     00:00:00 [kthreadd]
...
jerry     2464     1  2464  0   33 十二23 tty2  00:11:23 /opt/google/chrome/chrome
jerry     2464     1  2472  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2479  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2480  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2483  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2484  0   33 十二23 tty2  00:04:37 /opt/google/chrome/chrome
jerry     2464     1  2485  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome


A process group is a collection of one or more processes. Each process group can have a process group leader. The leader is identified by its process ID equals its process group ID.

A session is a collection of one or more process groups. A process establishes a new session by calling the setsid() function.

A controlling terminal is the terminal device(tty/pts):
  • A session can only have one controlling terminal.
  • The session leader that establishes the connection to the controlling terminal is called the controlling process.
  • The process groups within a session can be divided into a single foreground process group, and, one or more background process groups.

Foreground processes (also referred to as interactive processes) are initialized and controlled through a terminal session. Background processes (also referred to as non-interactive/automatic processes) – are processes not connected to a terminal; they don’t expect any user input. A new process is normally created when an existing process makes an exact copy of itself in memory by fork().

Job control allows us to start multiple jobs from a single terminal and controls which jobs can access the terminal and which jobs are to run in the background.
Job control requires:
  • A shell that supports job control
  • The terminal driver supports job control
  • Support for job control signals
To start a process in the background (non-interactive), use the & symbol. 此時 bash 會給予這個指令一個『工作號碼(job number)』 如果想要知道目前有多少的工作在背景當中,就用 jobs 這個指令,可以加上 -l 這個參數得到job number 的 PID 經過 fg 指令就能夠將背景工作拿到前景來處理.
pgrep looks through the currently running processes and lists the process IDs which match the selection criteria to stdout. All the criteria have to match. For example,

              $ pgrep -u root sshd
will only list the processes named sshd AND owned by root. On the other hand,

              $ pgrep -u root,daemon
will list the processes owned by root OR daemon. pkill will send the specified signal (by default SIGTERM) to each process matching a pattern.

              $ pkill chrom
The above kill processes include chrome. "killall" needs an exact process name.

PROCESS STATE CODES:
Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process:
D    uninterruptible sleep (usually IO)
R    running or runnable (on run queue)
S    interruptible sleep (waiting for an event to complete)
T    stopped by job control signal
t    stopped by debugger during the tracing
W    paging (not valid since the 2.6.xx kernel)
X    dead (should never be seen)
Z    defunct ("zombie") process, terminated but not reaped by its parent
For BSD formats and when the stat keyword is used, additional characters may be displayed:
<    high-priority (not nice to other users)
N    low-priority (nice to other users)
L    has pages locked into memory (for real-time and custom IO)
s    is a session leader
l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
+    is in the foreground process group

Linux Server Hacks

Remove Unnecessary Services

Use "ps ax" to check if unnecessary services exists:
  • NFS
  • portmap, rpc.mountd, rpc.nfsd
  • Samba
  • smbd, nmbd
  • automount
  • All statically mount is setup via /etc/fstab.

User Management

  • Creating a User
  • adduser
  • Deleting, disabling account
    • passwd -l 'username'
    • Lock the password of the named account. This option disables a password by changing it to a value which matches no possible encrypted value (it adds a ´!´ at the beginning of the password).
    • userdel -r 'username'
    • With this option, files in the user's home directory will be removed along with the home directory itself and the user's mail spool.
  • Modify groups
  • The groupmod command modifies the definition of the specified GROUP by modifying the appropriate entry in the group database
  • Modify an account
    • add a user to a group
    • usermod -a -G GROUPNAME USERNAME
    • Gives information on all users
    • finger
deluser and delgroup remove users and groups from the system according to command line options and configuration information in /etc/deluser.conf and /etc/adduser.conf. They are friendlier front ends to the userdel and groupdel programs

Package Management

Debian/Ubuntu

dpkg

dpkg is a tool to install, build, remove and manage Debian packages. The primary and more user friendly front-end for dpkg is aptitude.

aptitude

aptitude is a text-based interface to the Debian GNU/Linux package system. It allows the user to view the list of packages and to perform package management tasks such as installing, upgrading, and removing packages. Actions may be performed from a visual interface or from the command-line.

apt-get, apt

apt-get is the command-line tool for handling packages, and may be considered the user's "back-end" to other tools(aptitude, synaptic and wajig) using the APT library. apt provides a high-level command line interface for the package management system. It is intended as an end user interface and enables some options better suited for interactive usage by default. apt is less-commonly used and differs from apt-get mostly in terms of output formatting.
  • update
  • update is used to resynchronize the package index files from their sources. The indexes of available packages are fetched from the location(s) specified in /etc/apt/sources.list.
  • upgrade
  • upgrade is used to install the newest versions of all packages currently installed on the system from the sources enumerated in /etc/apt/sources.list.
  • install
  • install is followed by one or more packages desired for installation or upgrading.

Network skill

IP

Ip header

  • big endian
  • Packets are transmitted in the order: bits 0-7, 8-15, 16-23, 24-31. This is called big endian byte ordering or network byte order. "endian"決定multi-bytes data放在記憶體位址的順序, 以一個4 bytes(32-burs)暫存器(0x100)來說, 4 bytes data "0x 01 23 45 67"被存放的順序:
    • big-endian
    • LSB存在高位址
      
      0x100: 01 23 45 67
      0x104: 
      
    • little-endian
    • LSB存在低位址
      
      0x100: 67 45 23 01
      0x104: 
      
  • fragmentation
  • The max. length of the IP datagram is 65535(16 bits total length field), some physical layer will fragment it or pad it when necessary. The identification field is incremented by 1 each time a datagram is sent. This is useful for fragmentation and reassembly process. flags and fragmentation fields are used for fragmentation.
  • TTL(time-to-live)
  • TTL sets an upper limit on the number of routers throught which a datagram can pass. It is intialized by the sender to some value(32/64) and decremented by 1 every router that handles the datagram. When this field reaches 0, the datagram is thrown away, and the sender is notified with an ICMP message.

IP Routing

The IP layer has a routing table in memory that it searches each time it receives a datagram. Each entry in the routing table contains the following information:
  • destination IP
  • IP of the next-hop router
  • flags
  • Specify if the destination IP is for a network or a host.
  • interface
  • Which network interface should be used for transmission
There is an entry labeled "default". If the destination IP can't match any entry, packets are sent to the next-hop router of the default.

Subnet


網路等級IP分佈範圍可用網路組可連結主機數目
A 0.0.0.0~127.0.0.0 126組 16,777,214
B 128.0.0.0~191.255.0.0 16383組 65,534
C 192.0.0.0~223.255.255.0 2,097,152組 254
D 224.0.0.0~239.255.255.255
E 240.0.0.0~255.255.255.255
All IPv4 IP addresses can be divided into two major groups: public or private.
Private internal addresses are not routed on the Internet and no traffic cannot be sent to them from the Internet, they only supposed to work within the local network. Private addresses include IP addresses from the following classes:
  • A
  • Range from 10.0.0.0 to 10.255.255.255, 10.0.0.0/255.0.0.0
  • B
  • Range from 172.16.0.0 to 172.31.255.255, 172.16.0.0/255.240.0.0
  • C
  • A 192.168.0.0 to 192.168.255.255 range, 192.168.0.0/255.255.0.0

The host ID portion can be divided into a subnet ID and a host ID. The subnet mask is a 32-bits value containing "1" bits for the network ID and subnet ID, and "0" bits for the host ID. Therefore,
  • IP address tells you what network class you are using
  • The subnet mask tells you the boundary between hosts and subnets

TCP



TCP services

TCP provides reliability:
  • the bytes passed by TCP to IP is called a segment.
  • TCP uses a timer to wait for an ACK to the segment it sent.
  • If an ACK is not received in time, the segment is re-transmitted.
  • TCP sends an ACK for each data it received
  • When a new connection is being established, the SYNC flag is turned on. The sequence number field contains the initial sequence number chosen by the host initializing the connection. The acknowledge number contains the next sequence number that the sender expects to receive.( acknowledge number = received sequence number + 1 ) TCP provides full-duplex service so that each end of a connection must maintain a sequence number in each direction.
  • TCP maintains a checksum on its header and data
  • If a segment arrives with an invalid checksum, TCP discards it and does not ACK it.
  • TCP resequences the received data when necessary
  • IP datagram can arrive out of order.
  • TCP must discard duplicated received data
  • TCP provides flow control
  • A receiving TCP only allows the other end to send as much data as the receiver buffer. Every time TCP receives a packet, it needs to send an ACK, this ACK message includes the 16-bits current receive window size field,
    
       rwnd_size = ReceiveBuffer - (LastByteReceived – LastByteReadByApplication)
    
    so the sender knows if it can keep sending data. TCP uses a sliding window protocol to control the number of bytes it can send. This make sure it never has more bytes in sending than the window advertised by the receiver The sender will always keep this invariant:
    
      LastByteSent - LastByteAcked <= ReceiveWindowAdvertised
    
    In today’s networks, this 16-bits window size( max. 65,535 )is not enough to provide optimal traffic flow, TCP Options were introduced in RFC 1323 that enable the TCP receive window to be increased exponentially. The specific function is called TCP Window Scaling, which is advertised in the handshake process. If one side or the other cannot support scaling, then neither will use this function. The scale factor, or multiplier, will only be sent in the SYN packets during the handshake and will be used for the life of the connection. When the TCP sender receives an ACK with zero-window message it starts the persist timer, TCP will stop transmitting data and periodically send a small packet to the receiver (usually called ZeroWindowProbe,rfc), so the receiver has a chance to advertise a nonzero window size. The transmitting host SHOULD send the first zero-window probe when a zero window has existed for the retransmission timeout period, and SHOULD increase exponentially the timeout interval between successive probes.

TCP Connection Establishment and Termination


  • establishment: 3-way handshake
    1. the client sends SYN + ISN
    2. the server responds with its own SYN + server's ISN + ACK(client's ISN+1)
    3. the client ACK(server's ISN )
  • termination
  • Ether end can send the FIN when it is done sending data. When a TCP receives a FIN, it sends back a ACK( received SN + 1 ). Therefore, this needs 4 steps.

TCP Timers


TCP maintains the connection state internally and requires timers to keep track of events.
The TCP requires 3 timers to maintain the state on the transmit side in the protocol.
In Linux, when timers are initialized, they are given an associated function that is called when the timer goes off. Each timer function for TCP is passed a pointer to the sock structure. The timer uses the sock to know which connection it is dealing with. The timer functions can be found in the file linux/net/tcp_timer.c.
  • tcp_retransmit_timer()
  • TCP uses a time out timer for retransmission of lost segments. This is called when the retransmit timer expires, indicating that an expected acknowledgment was not received.
    • Sender starts a time out timer after transmitting a TCP segment to the receiver.
      • If sender receives an ACK before the timer goes off, it stops the timer.
      • If sender does not receives any acknowledgement and the timer goes off, then TCP Retransmission occurs.
      • Sender retransmits the same segment and resets the timer
    • The value of time out timer is dynamic and changes with the amount of traffic in the network.
  • tcp_probe_timer()
  • The zero window timer is set when this side of the connection sends out a zero window probe in response to a zero window advertisement from the peer. We arrive at this function because the timer expired before a response was received to the zero window probe.
  • tcp_delack_timer()
  • This is to minimize the number of separate ACKs that are sent. The receiver does not send an ACK as soon as it can. The delayed acknowledgment timer is set to the amount of time to hold the ACK waiting for outgoing data to be ready.
  • tcp_keepalive_timer()
  • TCP uses a keep alive timer to prevent long idle TCP connections. If a client opens a TCP connection to a server transfers some data and becomes silent the client will crash. In this case, the connection remains open forever. TCP normally does not perform any keepalive function; keepalive polling is not part of the TCP specification. The keepalive is added outside the TCP specification for the use of some TCP application layer servers for protocols that don’t do any connection polling themselves. For example, the telnet daemon sets the keepalive mode.
    • Each time server hears from the client, it resets the keep alive timer to 2 hours.
    • If server does not hear from the client for 2 hours, it sends 10 probe segments to the client.
    • These probe segments are sent at a gap of 75 seconds.
    • If server receives no response after sending 10 probe segments, it assumes that the client is down.
    • Then, server terminates the connection automatically.
    In addition to providing the keepalive timeout function, it is also used as a SYN acknowledge timer by a socket in a listen state.
  • Time Wait Timer
  • TCP uses a time wait timer during connection termination.
    • Sender starts the time wait timer after sending the ACK for the second FIN segment.
    • It allows to resend the final acknowledgement if it gets lost.
    • It prevents the just closed port from reopening again quickly to some other application.
    • It ensures that all the segments heading towards the just closed port are discarded.
    • The value of time wait timer is usually set to twice the lifetime of a TCP segment.

Firewall

A Packet-Filtering Firewall

A packet-filtering firewall consists of a list of acceptance and denial rules. The list of rules defining what can come in and what can go out are called chains. A packet is matched against each rule in the list, 1-by-1, until a match is found or the list is exhausted.

Choosing a Default Packet-filtering Policy

If the packet does not match any rule, the default policy for a chain is applied to the packet,
  • ACCEPT
  • means to let the packet through.
  • DROP
  • means to drop the packet on the floor.
  • RETURN
  • means stop traversing this chain and resume at the next rule in the previous (calling) chain.
In a security-conscious system, this policy usually tells the kernel to DROP the packet.

How Packets Traverse The Filters

The kernel starts with three lists of rules( chains ) in the `filter' tables: INPUT, OUTPUT and FORWARD.

Incoming                                          Outgoing
                                                     
   |                                                  ^
   |                                                  |
   v                                                  |
 ------------                                  --------------
|Pre-Routing |                                 |Post-routing |
 ------------                                  --------------
   |                                                  |
   |                        _____                     |
   |                       /     \                    |
   +---- -->[Routing ]--->|FORWARD|-------+--->-------+
            [Decision]     \_____/        ^
                 |                        |
                 v                       ____
                ___                     /    \
               /   \                   |OUTPUT|
              |INPUT|                   \____/
               \___/                      ^
                 |                        |
                 +----> Local Process ->--+

  • When a packet comes in (say, through the Ethernet card) the kernel first looks at the destination of the packet: this is called `routing'.
  • If it's destined for this box, the packet passes downwards in the diagram, to the INPUT chain. If it passes this, any processes waiting for that packet will receive it.
  • If the kernel does not have forwarding enabled, or it doesn't know how to forward the packet, the packet is dropped.
  • If forwarding is enabled, and the packet is destined for another network interface (if you have another one), then the packet goes rightwards on our diagram to the FORWARD chain. If it is ACCEPTed, it will be sent out.
  • A program running on the box can send network packets. These packets pass through the OUTPUT chain immediately: if it says ACCEPT, then the packet continues out to whatever interface it is destined for.
There are five netfilter hooks that programs can register with. As packets traverse through the stack, they will trigger the kernel modules registered with these hooks. The following hooks represent various well-defined points in the networking stack:
  • NF_IP_PRE_ROUTING
  • This hook will be triggered by any incoming traffic very soon after entering the network stack. This hook is processed before any routing decisions have been made regarding where to send the packet.
  • NF_IP_LOCAL_IN
  • This hook is triggered after an incoming packet has been routed if the packet is destined for the local system.
  • NF_IP_FORWARD
  • This hook is triggered after an incoming packet has been routed if the packet is to be forwarded to another host.
  • NF_IP_LOCAL_OUT
  • This hook is triggered by any locally created outbound traffic as soon it hits the network stack.
  • NF_IP_POST_ROUTING
  • This hook is triggered by any outgoing or forwarded traffic after routing has taken place and just before being put out on the wire.

Using iptables

Iptables and ip6tables are used to set up, maintain, and inspect the tables of IPv4 and IPv6 packet filter rules in the Linux kernel. The iptables uses tables to organize its rules. These tables classify rules according to the type of decisions they are used to make. The names of the built-in chains mirror the names of the netfilter hooks they are associated with:
  • PREROUTING
  • Triggered by the NF_IP_PRE_ROUTING hook.
  • INPUT
  • Triggered by the NF_IP_LOCAL_IN hook.
  • FORWARD
  • Triggered by the NF_IP_FORWARD hook.
  • OUTPUT
  • Triggered by the NF_IP_LOCAL_OUT hook.
  • POSTROUTING
  • Triggered by the NF_IP_POST_ROUTING hook.
Each table has its chains. The 3 primary tables:
  • filter
  • The default table. The built-in chains:
    • INPUT
    • OUTPUT
    • FORWARD
    This contains the actual filtering rules.
  • nat
    • PREROUTING
    • OUTPUT
    • POSTROUTING
    This contains rules for address/port translation.
  • mangle
    • PREROUTING
    • OUTPUT
    This contains rules for changing packet-routing flags.
A firewall rule specifies criteria for a packet and a target. If the packet does not match, the next rule in the chain is examined; if it does match, then the next rule is specified by the value of the target, which can be
  • the name of a user-defined chain
  • one of the targets described in iptables-extensions(8)
  • one of the special values ACCEPT, DROP or RETURN.
SYNOPSIS:
  • Create a new user-defined chain by the given name.
  • iptables [-t table] -N chain
  • Flush the selected chain (all the chains in the table if none is given).
  • iptables [-t table] -F [chain [rulenum]] [options...]
  • Append/Check/Delete one or more rules to the end of the selected chain.
  • iptables [-t table] {-A|-C|-D} chain rule-specification
  • Set the default policy for the built-in (non-user-defined) chain to the given target( ACCEPT or DROP ).
  • iptables [-t table] -P chain target
  • List all rules in the selected chain.
  • iptables [-t table] -L [chain [rulenum]] [options...] The list command can take additional options:
    • -n
    • List IP and port numbers numerically but by name.
    • -v
    • List additional information such as counters.
    • --line-numbers
    • List the rule's position within the chain
    • -x
    • List exact values of counters
  • Insert one or more rules in the selected chain as the given rule number.
  • iptables [-t table] -I chain [rulenum] rule-specification
  • Replace a rule in the selected chain.
  • iptables [-t table] -R chain rulenum rule-specification
  • Delete one or more rules from the selected chain.
  • iptables [-t table] -D chain rulenum
  • Print all rules in the selected chain.
  • iptables [-t table] -S [chain [rulenum]]
  • List all rules in the selected chain.
  • iptables [-t table] -L [chain [rulenum]] [options...]
  • Delete the optional user-defined chain specified.
  • iptables [-t table] -X [chain]
  • Rename the user specified chain to the user supplied name.
  • iptables [-t table] -E old-chain-name new-chain-name

rule-specification = parameter-1 option-1
                     ...
                     parameter-n> option-n

The rule-specification are composed of pairs of parameters and options that define what happens when a packet matches the rule.
Basic parameters:
  • -i [!] name
  • Name of an interface via which a packet was received (only for packets entering the INPUT, FORWARD and PREROUTING chains).
  • -o [!] name
  • Name of an interface via which a packet is going to be sent (for packets entering the FORWARD, OUTPUT and POSTROUTING chains).
  • -p [!] protocol
  • The protocol of the rule or of the packet to check. The specified protocol can be one of tcp, udp, icmp, or all, or it can be a numeric value, representing one of these protocols or a different one. A protocol name from /etc/protocols is also allowed.
  • -s [!] address[/mask]
  • Source specification. Address can be either a network name, a hostname, a network IP address (with /mask), or a plain IP address. The mask can be either a network mask or a plain number, specifying the number of 1's at the left side of the network mask. Thus, a mask of 24 is equivalent to 255.255.255.0.
  • -d [!] address[/mask]
  • Destination specification.
  • -j target
  • This specifies the target of the rule; i.e., what to do if the packet matches it. The target can be a user-defined chain, one of the special builtin targets, or an extension. If this option is omitted in a rule, then matching the rule will have no effect on the packet's fate, but the counters on the rule will be incremented.
Different network protocols provide specialized matching options which can be configured to match a particular packet using that protocol.
  • options are available for the TCP protocol (-p tcp)
    • --dport [!] port[:port]
    • Sets the destination port for the packet. Use either a network service name (such as www or smtp), port number, or range of port numbers to configure this option.
    • --sport [!] port[:port]
    • Sets the source port of the packet
    • [!] --syn
    • Only match TCP packets with the SYN bit set and the ACK,RST and FIN bits cleared. Such packets are used to request TCP connection initiation.
    • --tcp-flags [!] mask comp
    • The first argument is the flags which we should examine, written as a comma-separated list, and the second argument is a comma-separated list of flags which must be set. For example, "-p tcp --tcp-flags ACK,FIN,SYN SYN" is equivalent to "--syn".
    • --tcp-option [!] number
    • Match if TCP option set.
  • options are available for the UDP protocol (-p udp)
    • --dport [!] port[:port]
    • --sport [!] port[:port]
  • ICMP Protocol (-p icmp)
    • --icmp-type [!] typename
    • Sets the name or number of the ICMP type to match with the rule.
Additional match options are also available through modules loaded by the iptables command.
To use a match option module, load the module by name using the -m option.
  • -m limit
  • Places limits on how many packets are matched to a particular rule. A rule using this extension will match until this limit is reached. It can be used in combination with the LOG target to give limited logging.
    • --limit rate
    • Maximum average matching rate: specified as a number, with an optional '/second', '/minute', '/hour', or '/day' suffix; the default is 3/hour.
    • --limit-burst number
    • Maximum initial number of packets to match: this number gets recharged by one every time the limit specified above is not reached, up to this number; the default is 5.
  • -m state
  • Enable the access to the connection tracking state for this packet.
    
     --state state
    
    Where state is a comma separated list of the connection states to match:
    • ESTABLISHED
    • The matching packet is associated with other packets in an established connection.
    • INVALID
    • The matching packet cannot be tied to a known connection.
    • NEW
    • The matching packet is either creating a new connection or is part of a two-way connection not previously seen.
    • RELATED
    • The matching packet is starting a new connection related in some way to an existing connection.
  • -m mac
  • Enables hardware MAC address matching.
    • --mac-source [!] address
    • It must be of the form XX:XX:XX:XX:XX:XX. Note that this only makes sense for packets coming from an Ethernet device and entering the PREROUTING, FORWARD or INPUT chains.
  • -mark
  • Enable the netfilter mark field associated with a packet.
    • --mark value[/mask]
    • Matches packets with the given unsigned mark value which was set at some earlier point.
  • -tos
  • This module matches the 8 bits of Type of Service field in the IP header (ie. including the precedence bits).
    • --tos value
    • The value can be a string or a numeric value.
Once a packet has matched a particular rule, the rule can direct the packet to a number of different targets that decide its fate and, possibly, take additional actions. Each chain has a default target, which is used if none of the rules on that chain match a packet or if none of the rules which match the packet specify a target.
The following are the standard targets:
  • -j user-defined-chain
  • This target passes the packet to the target user-defined-chain.
  • -j ACCEPT
  • Allows the packet to successfully move on to its destination or another chain.
  • -j DROP
  • Drops the packet without responding to the requester.
  • -j QUEUE
  • The packet is queued for handling by a user-space application.
  • -j RETURN
  • Stops checking the packet against rules in the current chain, the packet is returned to the first chain to resume rule checking where it left off.
iptables can use extended target modules: the following are included in the standard distribution.
  • -j LOG
  • Logs all packets that match this rule. Since the packets are logged by the kernel, it can be read with dmesg or syslogd, the /etc/syslog.conf file determines where these log entries are written. By default, they are placed in the /var/log/messages file. This is a "non-terminating target"(packet will not be dropped), i.e. rule traversal continues at the next rule. To specify the way in which logging occurs
    • --log-level
    • Sets the priority level of a logging event. A list of priority levels can be found within the syslog.conf man page.
    • --log-ip-options
    • Logs any options set in the header of a IP packet.
    • --log-prefix
    • Places a string of up to 29 characters before the log line when it is written. This is useful for writing syslog filters for use in conjunction with packet logging.
    • --log-tcp-options
    • Logs any options set in the header of a TCP packet.
    • --log-tcp-sequence
    • Writes the TCP sequence number for the packet in the log.
  • -j REJECT
  • This is used to send back an error packet in response to the matched packet. The following option controls the nature of the error packet returned:
    
    --reject-with type
    
    where type:
    • icmp-net-unreachable
    • icmp-host-unreachable
    • icmp-port-unreachable
    • icmp-proto-unreachable
    • icmp-net-prohibited
    • icmp-host-prohibited
  • -j SNAT
  • This target is only valid in the nat table, in the POSTROUTING chain. It specifies that the source address of the packet should be modified (and all future packets in this connection will also be mangled), and rules should cease being examined. It takes one type of option:
    • --to-source ipaddr[-ipaddr][:port-port]
    • which can specify a single new source IP address, an inclusive range of IP addresses, and optionally, a port range. The source port is mapped to a free port if not assigned.
  • -j MASQUERADE
  • This target is only valid in the nat table, in the POSTROUTING chain. It should only be used with dynamically assigned IP (dialup) connections. ( if you have a static IP address, you should use the SNAT target.)
  • -j DNAT
  • This target is only valid in the nat table, in the PREROUTING and OUTPUT chains. It specifies that the destination address of the packet should be modified. It takes one type of option:
    • --to-destination ipaddr[-ipaddr][:port-port]
    • which can specify a single new destination IP address, an inclusive range of IP addresses, and optionally, a port range. If no port range is specified, then the destination port will never be modified. This feature is useful when you want to forward connections to internal servers that are not publicly visible.
  • -j REDIRECT
  • This target is only valid in the nat table, in the PREROUTING and OUTPUT chains. It redirects the packet to the machine itself by changing the destination IP to the primary address of the incoming interface. It takes one option:
    • --to-ports port[-port]
    • This specifies a destination port or range of ports to use: without this, the destination port is never altered.

DNS: The Domain Name System

DNS is a distributed database that is used by TCP/IP applications to map between hostnames and IP addresses. DNS provides the protocol for clients and servers to communicate with each other. From application's point of view, access to the DNS is through a name resolver which contacts one or more name servers to do the mapping. The resolver can be accessed by 2 library functions:
  • getaddrinfo()
  • gethostbyname() is deprecated.
  • getnameinfo()
  • gethostbyaddr() is deprecated.

DNS Basics

The DNS name space is a hierarchical tree, similar to file system.
  • node
  • Every node has a label (max 63 characters)
  • root
  • The root is a node with null label.
  • domain name
  • The domain name of any node is the list of labels: starting at the node, walking up to the root, using "." to separate labels.
  • FQDN(fully qualified domain name)
  • A domain name that ends with a "." is called an absolute domain name or FQDN.
For 3-character generic domains, .gov and .mil are only used for U.S.. Many countries form second-level domains beneath their 2-character country code: .com.tw , .edu.tw are used for commercial and education organizations respectively in Taiwan. One entity maintains the top-level domains, delegates are responsible to others for specific zones. A zone is a subtree of the DNS tree. Once the authority for a zone is delegated, the person responsible for the zone must provide a primary name server for that zone and one or more secondary name servers. The main difference between a primary and a secondary is that the primary loads all the information for a zone from disk files, while the secondary obtains all the information from the primary regularly. When a name server does not contain the requested information, it must contact another name server. At least, every name server must know how to contact the root name servers, the root name servers know the name server for all second-level domains.

DNS message format

DNS configuration

The file /etc/resolv.conf is actually used indirectly now.
The network manager does it now.:

nmcli device show interfacename | grep IP4.DNS

Linux Server Hacks, Volume Two: System Services

#20 Quick and Easy DHCP Setup

  • Installing a DHCP Server
  • Debian:
    
    apt-get install isc-dhcp-server
    
    Fedora:
    
    yum install dhcp
    
  • Configuring Simple DHCP Services
  • The configuration file for this service is : /etc/dhcp/dhcpd.conf
    
    # option definitions common to all supported networks...
    option domain-name "example.org";
    option domain-name-servers ns1.example.org, ns2.example.org;
    
    default-lease-time 600;
    max-lease-time 7200;
    option domain-name "isc.org";
    option domain-name-servers ns1.isc.org, ns2.isc.org;
    
    # The ddns-updates-style parameter controls whether or not the server will
    # attempt to do a DNS update when a lease is confirmed. We default to the
    # behavior of the version 2 packages ('none', since DHCP v2 didn't
    # have support for DDNS.)
    ddns-update-style none;
    
    # If this DHCP server is the official DHCP server for the local
    # network, the authoritative directive should be uncommented.
    #authoritative;
    
    # Use this to send dhcp log messages to a different log file (you also
    # have to hack syslog.conf to complete the redirection).
    #log-facility local7;
    
    # No service will be given on this subnet, but declaring it helps the 
    # DHCP server to understand the network topology.
    #subnet 10.152.187.0 netmask 255.255.255.0 {
    #}
    
    # This is a very basic subnet declaration.
    
    #subnet 10.254.239.0 netmask 255.255.255.224 {
    #  range 10.254.239.10 10.254.239.20;
    #  option routers rtr-239-0-1.example.org, rtr-239-0-2.example.org;
    #}
    
    # This declaration allows BOOTP clients to get dynamic addresses,
    # which we don't really recommend.
    
    #subnet 10.254.239.32 netmask 255.255.255.224 {
    #  range dynamic-bootp 10.254.239.40 10.254.239.60;
    #  option broadcast-address 10.254.239.31;
    #  option routers rtr-239-32-1.example.org;
    #}
    
    # A slightly different configuration for an internal subnet.
    #subnet 10.5.5.0 netmask 255.255.255.224 {
    #  range 10.5.5.26 10.5.5.30;
    #  option domain-name-servers ns1.internal.example.org;
    #  option domain-name "internal.example.org";
    #  option subnet-mask 255.255.255.224;
    #  option routers 10.5.5.1;
    #  option broadcast-address 10.5.5.31;
    #  default-lease-time 600;
    #  max-lease-time 7200;
    #}
    # Hosts which require special configuration options can be listed in
    # host statements.   If no address is specified, the address will be
    # allocated dynamically (if possible), but the host-specific information
    # will still come from the host declaration.
    
    #host passacaglia {
    #  hardware ethernet 0:0:c0:5d:bd:95;
    #  filename "vmunix.passacaglia";
    #  server-name "toccata.example.com";
    #}
    
    # Fixed IP addresses can also be specified for hosts.   These addresses
    # should not also be listed as being available for dynamic assignment.
    # Hosts for which fixed IP addresses have been specified can boot using
    # BOOTP or DHCP.   Hosts for which no fixed address is specified can only
    # be booted with DHCP, unless there is an address range on the subnet
    # to which a BOOTP client is connected which has the dynamic-bootp flag
    # set.
    #host fantasia {
    #  hardware ethernet 08:00:07:26:c0:a5;
    #  fixed-address fantasia.example.com;
    #}
    
    # You can declare a class of clients and then do address allocation
    # based on that.   The example below shows a case where all clients
    # in a certain class get addresses on the 10.17.224/24 subnet, and all
    # other clients get addresses on the 10.0.29/24 subnet.
    
    #class "foo" {
    #  match if substring (option vendor-class-identifier, 0, 4) = "SUNW";
    # set.
    #host fantasia {
    #  hardware ethernet 08:00:07:26:c0:a5;
    #  fixed-address fantasia.example.com;
    #}
    
    # You can declare a class of clients and then do address allocation
    # based on that.   The example below shows a case where all clients
    # in a certain class get addresses on the 10.17.224/24 subnet, and all
    # other clients get addresses on the 10.0.29/24 subnet.
    
    #class "foo" {
    #  match if substring (option vendor-class-identifier, 0, 4) = "SUNW";
    #}
    
    #shared-network 224-29 {
    #  subnet 10.17.224.0 netmask 255.255.255.0 {
    #    option routers rtr-224.example.org;
    #  }
    #  subnet 10.0.29.0 netmask 255.255.255.0 {
    #    option routers rtr-29.example.org;
    #  }
    #  pool {
    #    allow members of "foo";
    #    range 10.17.224.10 10.17.224.250;
    #  }
    #  pool {
    #    deny members of "foo";
    #    range 10.0.29.10 10.0.29.230;
    #  }
    #}
    
    
    man dhcpc.conf:
    • DHCP Global configuration
    • The basic configuration that we need in order to run a DHCP server:
      • default-lease-time
      • max-lease-time
      • INTERFACESv4="eth0"
      • Define which interface the DHCP server should use to serve DHCP requests.
      • authoritative
    • Defining the Subnet
    • Each subnet may have its own router.
      
             subnet 204.254.239.0 netmask 255.255.255.224 {
               ( subnet-specific parameters... )
               range 204.254.239.10 204.254.239.30;
             }
      
      
      Notice, not to assign fixed addresses that overlap with the pool you’ve configured in your subnet statement.
    • Defining a group of host
    • 
             group {
               ( group-specific parameters... )
               host zappo.test.isc.org {
                 ( host-specific parameters... )
               }
               host beppo.test.isc.org {
                 ( host-specific parameters... )
               }
               host harpo.test.isc.org {
                 ( host-specific parameters... )
               }
             }
      
  • Manage DHCP services
  • To verify if the service is running fine,
    
    sudo systemctl status isc-dhcp-server.service
    
    To start the DHCP service,,
    
    sudo systemctl start isc-dhcp-server.service
    
    To stop the DHCP service,
    
    sudo systemctl stop isc-dhcp-server.service
    
    To restart the DHCP service,
    
    sudo systemctl restart isc-dhcp-server.service
    

#21 Integrate DHCP and DNS with Dynamic DNS Updates

If DNS and DHCP servers are not in sync, a DHCP release for a new IP may cause the name resolution problem. There are 2 solutions: statically assign addresses to your hosts, or use a tool (or script one yourself) to perform DNS updates. In more recent versions of DHCP and BIND, both services support a mechanism for performing dynamic DNS updates (defined in RFC 2136).
  • Generating session key
  • The two services will use a key to communicate with each other. The DHCP server uses this key to sign update requests sent to the DNS server, and the DNS server uses it to verify the signed requests from the DHCP server. BIND 9 comes with a utility to generate this key, called dnssec-keygen.
  • Configuring the BIND Name Server
  • The next step is to configure BIND to allow updates from the DHCP server, using the key you just generated.
  • Configuring the ISC DHCP Server

Linux network troubleshooting tools

Cheat sheet

ip

ip - show / manipulate routing, network devices, interfaces and tunnels .

link layer

network device. Show the status:

$ ip link show
1: lo: <loopback> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <no-carrier> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 5c:26:0a:13:82:cf brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <broadcast> mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether a4:4e:31:a6:78:64 brd ff:ff:ff:ff:ff:ff
		
Bring the interface up,

$ sudo ip link set eno1 up
		
We can use the -s flag with the ip command to print additional statistics about an interface,

$ ip -s link show
1: lo: <loopback> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    RX: bytes  packets  errors  dropped overrun mcast   
    6259323    66161    0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    6259323    66161    0       0       0       0       
2: eno1: <no-carrier> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 5c:26:0a:13:82:cf brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    0          0        0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    0          0        0       0       0       0       
3: wlp2s0: <broadcast> mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether a4:4e:31:a6:78:64 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    1373619334 1354333  0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    229249749  830716   0       0       0       0
		
For more advanced Layer 1 troubleshooting, the ethtool/wireshark utility is an excellent option.

data link layer

If your localhost can’t successfully resolve its gateway’s Layer 2 MAC address, then it won’t be able to send any traffic to remote networks.
We can check the entries in our ARP table with the ip neighbor command:

        $ ip neighbor show
192.168.0.1 dev wlp2s0 lladdr c0:a0:bb:ef:3d:d7 REACHABLE
fe80::c2a0:bbff:feef:3dd7 dev wlp2s0 lladdr c0:a0:bb:ef:3d:d7 router REACHABLE
		
If there was a problem with ARP, then we would see a resolution failure.
Linux caches the ARP entry for a period of time, you can manually delete an ARP entry, which will force a new ARP discovery process:

$ ip neighbor delete 192.168.0.1 dev wlp2s0
		

network/internet layer


$ ip -br address show
lo               UNKNOWN        127.0.0.1/8 ::1/128 
eno1             DOWN           
wlp2s0           UP             192.168.0.105/24 2001:b011:5003:14ec:2976:d29d:44a9:5cd6/64 2001:b011:5003:14ec:e136:5b5f:ebb4:4e4e/64 fe80::8205:ef53:5b3b:7756/64 
-br prints only basic information in a tabular format for better readability.
This option is currently only supported by ip addr show and ip link show commands.
The lack of an IP address can be caused by a local misconfiguration, such as an incorrect network interface config file, or it can be caused by problems with DHCP.
ping can be an easy way to tell if a host is alive and responding.
The next tool in the Layer 3 troubleshooting tool belt is the traceroute command.
Traceroute will send out one packet at a time, beginning with a TTL of one. Since the packet expires in transit, the upstream router sends back an ICMP Time-to-Live Exceeded packet.
Traceroute then increments the TTL to determine the next hop.
The list of gateways for different routes is stored in a routing table, which can be inspected and manipulated using ip route commands.
We can print the routing table:

$ ip route show
default via 192.168.0.1 dev wlp2s0 proto dhcp metric 600 
169.254.0.0/16 dev wlp2s0 scope link metric 1000 
192.168.0.0/24 dev wlp2s0 proto kernel scope link src 192.168.0.105 metric 600 
we can check the route for a specific prefix:

$ ip route get 10.0.0.0/8
10.0.0.0 via 192.168.0.1 dev wlp2s0 src 192.168.0.105 uid 1000 
    cache 
To delete a default route:

$ sudo route delete default gw 192.168.1.250 eth0
The Domain Name System (DNS) translates IP addresses into human-readable names.
A DNS trouble is the ability to connect to a remote host by IP address but not its hostname.
Performing a quick nslookup on the hostname can tell us what happened.

transport layer

To find out which process is listing upon a port:
  • netstat
  • 
    $ netstat -tulpn 
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name   
    		
  • ss
  • ss is used to dump socket statistics.
    It allows showing information similar to netstat.
    It can display more TCP and state information than other tools.
    When no option is used ss displays a list of open non-listening sockets (e.g. TCP/UNIX/UDP) that have established connection.
    
    $ ss -tunlp4
    NetidState  Recv-Q  Send-Q    Local Address:Port    Peer Address:Port                                     
    udp  UNCONN 0       0           224.0.0.251:5353         0.0.0.0:*     users:(("chrome",pid=2410,fd=49))  
    udp  UNCONN 0       0           224.0.0.251:5353         0.0.0.0:*     users:(("chrome",pid=2372,fd=322)) 
    udp  UNCONN 0       0           224.0.0.251:5353         0.0.0.0:*     users:(("chrome",pid=2410,fd=120)) 
    udp  UNCONN 0       0               0.0.0.0:5353         0.0.0.0:*                                        
    udp  UNCONN 0       0         127.0.0.53%lo:53           0.0.0.0:*                                        
    udp  UNCONN 0       0               0.0.0.0:68           0.0.0.0:*                                        
    udp  UNCONN 0       0               0.0.0.0:43241        0.0.0.0:*                                        
    udp  UNCONN 0       0               0.0.0.0:631          0.0.0.0:*                                        
    tcp  LISTEN 0       128           127.0.0.1:5939         0.0.0.0:*                                        
    tcp  LISTEN 0       128       127.0.0.53%lo:53           0.0.0.0:*                                        
    tcp  LISTEN 0       5             127.0.0.1:631          0.0.0.0:*                                        
    tcp  LISTEN 0       100           127.0.0.1:25           0.0.0.0:*                                        
    tcp  LISTEN 0       80            127.0.0.1:3306         0.0.0.0:* 
    		
    flags:
    • -t - Show TCP ports.
    • -u - Show UDP ports.
    • -n - Do not try to resolve hostnames.
    • -l - Show only listening ports.
    • -p - Show the processes that are using a particular socket.
    • -4 - Show only IPv4 sockets.
To test if the remote server is listening,
  • TCP
  • 
    telent ip port
    		
  • UDP
  • Install netcat:
    
    sudo apt-get install netcat
    		
    Usage:
    
    nc ip -u port
    		
Use nc to test connections:
  • Test on a TCP port
  • Server:
    
    nc -lvnp 1234
            
    Client:
    
    nc -vn 192.168.0.112 1234
            
  • Test on a UDP port
  • Server:
    
    nc -lvnup 1234
            
    Client:
    
    nc -vnu 192.168.0.112 1234
            
  • Perform TCP port scanning against a remote host
  • 
    nc -vnz -w 1 192.168.0.101 20-25
    		
  • Perform UDP port scanning against a remote host
  • 
    nc -vnzu 192.168.40.146 1-65535
    		
  • Stream a video file from a server for client to watch the streamed video using video player (e.g., mplayer)
  • Server:
    
    cat sample_video.avi | nc -l 1234
    		
    Client:
    
    nc 192.168.0.101 1234 | mplayer -vo x11 -cache 3000 -
    		
However, a much more powerful tool is nmap which can scan TCP and UDP ports listened on the remote host.

$ nmap -v 192.168.122.1

PORT     STATE SERVICE
22/tcp   open  ssh
53/tcp   open  domain
80/tcp   open  http
443/tcp  open  https
8081/tcp open  blackice-icecap

A tcpdump Tutorial with Examples — 50 Ways to Isolate Traffic

The traffic can be dumpped with different ways:
  • on an interface
  • 
    tcpdump -i eth0        
            
  • by ip
  • 
    tcpdump host 1.1.1.1        
            
  • by source and/or destination
  • 
    tcpdump src 1.1.1.1
    tcpdump dst 1.0.0.1
            
  • by network
  • 
    tcpdump net 1.2.3.0/24        
            
  • hex output
  • 
    tcpdump -X icmp        
            
  • a specific port
  • 
    tcpdump port 3389
    tcpdump src port 1025        
            
  • protocol
  • 
    tcpdump icmp        
            
  • ip6 traffic
  • 
    tcpdump ip6        
            
  • port ranges
  • 
    tcpdump portrange 21-23        
            
  • packet size
  • 
    tcpdump less 32
    tcpdump greater 64
    tcpdump <= 128        
            
  • reading / writing captures to a file
  •         
    tcpdump -w capture_file  
    tcpdump -r capture_file      
            
  • More options
    • -t
    • Give human-readable timestamp output.
    • -v
    • Verbose output (more v’s gives more output).
  • combine options
  • and, or , not
    
    tcpdump -nnvvS src 10.5.2.3 and dst port 3389
        	

Troubleshooting and Performance Tuning

system hung, Linux kernel panic analysis, filesystem failed.

Linux Server Hacks, Volume Two: Troubleshooting and Performance

#69 Find Resource Hogs with Standard Commands

The first thing you can debug the resource problem is log into the machine and run the top command:

Tasks: 368 total,   1 running, 240 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us,  1.8 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem :  3960656 total,   490284 free,  2295076 used,  1175296 buff/cache
KiB Swap:  2097148 total,  1409676 free,   687472 used.  1323712 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                               
20446 root      20   0       0      0      0 I   6.0  0.0   0:03.42 kworker/u16:23-                       
 1780 jerry     20   0  567372  45480  32776 S   3.0  1.1  55:06.71 Xorg                                  
 1894 jerry     20   0 3970804 154548  34504 S   2.3  3.9  32:30.89 gnome-shell                           
 8825 jerry     20   0  753108  29704  18288 S   2.3  0.7   0:04.34 gnome-terminal-                       
 1211 mysql     20   0 1489996   4068      0 S   0.3  0.1   0:44.82 mysqld                                
 1932 jerry     20   0  362680   4988   2868 S   0.3  0.1   0:49.15 ibus-daemon                           
 2372 jerry     20   0 1352908 320352 116632 S   0.3  8.1  47:27.69 chrome                                
 4847 jerry     20   0  806508 128588  73200 S   0.3  3.2   0:29.62 chrome                                
12658 jerry     20   0 1272228 364528 122448 S   0.3  9.2   2:53.93 chrome                                
20092 root      20   0       0      0      0 I   0.3  0.0   0:01.78 kworker/0:0-eve                       
20743 jerry     20   0   51764   4412   3528 R   0.3  0.1   0:00.33 top                                   
    1 root      20   0  225852   6188   3684 S   0.0  0.2   0:36.19 systemd 
          
  • Tasks:
  • This shows total tasks or threads.That total is further classified as: running; sleeping; stopped; zombie. Only one process can run at a time on a single CPU. Note, the process is runnable: it is either currently running or on a runqueue waiting to run". Most processes are in one of the following two states:
    • A process that is on the CPU (a running process with state R)
    • A process that is off the CPU (a not-running process)
    • Process that is not running appears in one of the following state:
      • Runnable state (R)
      • The scheduler keeps that process in the run queue (the list of ready-to-run processes maintained by the kernel). When the CPU is available, this process will enter into Running state.
      • Sleeping state
      • A process enters a Sleeping state when it needs resources that are not currently available. When the resource the process is waiting on becomes available, a signal is sent to the CPU. The next time the scheduler gets a chance to schedule this sleeping process, the scheduler will put the process either in Running or Runnable state. There are two types of sleep states:
        • Interruptable sleep state (S)
        • An Interruptible sleep state means the process is waiting either for a particular time slot or for a particular event to occur.
        • Un-interruptable sleep state(D)
        • An Uninterruptible sleep state is one that won't handle a signal right away. It will wake only as a result of a waited-upon resource becoming available or after a time-out occurs during that wait. The Uninterruptible state is mostly used by device drivers waiting for disk or network I/O.
      • Defunct or Zombie state(Z)
      • Between the time when the process terminates and the parent releases the child process, the child enters into what is referred to as a Zombie state. The reason you cannot kill a Zombie process is that you cannot send a signal to the process to kill it as the process no longer exists.
    A common life cycle of a process is as follows:
    1. Born or forked
    2. Ready to run or runnable
    3. Running in user space or running in kernel space
    4. Blocked, Waiting, Sleeping, in an Interruptable sleep, or in an Uninterruptable sleep
    5. The process is sleeping, but it is present in main memory
    6. The process is sleeping, but it is present in secondary memory storage (swap space on disk)
    7. Terminated or stopped
    8. T = stopped by job control signal
      t = stopped by debugger during trace
      
  • %Cpu(s)
  • When a user initiates a process, the process starts working in user mode. When the kernel starts serving requests from user-level processes, the user-level process enters into kernel space. The top command's Cpu line shows the overall percentage of CPU work in user mode (us) and system mode (sy). This shows CPU state percentages based on the interval since the last refresh.There are two labels (recent kernel versions are shown first) shown below,
    • us, user : time running un-niced user processes
    • sy, system : time running kernel processes
    • ni, nice : time running niced user processes
    • id, idle : time spent in the kernel idle handler
    • If id is low, the CPU is working hard and doesn't have much excess capacity.
    • wa, IO-wait : time waiting for I/O completion
    • If wa is high, the CPU is ready to run, but is waiting on I/O access to complete (like fetching rows from a database table stored on the disk).
    • hi : time spent servicing hardware interrupts
    • si : time spent servicing software interrupts
    • st : time stolen from this vm by the hypervisor
    • Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor. Steal time is the amount of CPU time needed by a guest virtual machine that is not provided by the host. The VM kernel gets the steal metric from the hypervisor. ( in terms of losses) The hypervisor doesn’t specify which processes it is running. It just says: "I’m busy, and can’t allocate any time to you." The hypervisor doesn’t share with the VM information regarding what it is occupied with. Large amounts of steal time indicate CPU contention, which can reduce guest VM's performance. To relieve CPU contention, increase the guest VM's CPU priority or CPU quota, or run fewer guest VMs on the host. This data is supported by Xen and KVM virtual environments. A general rule of thumb - if steal time is greater than 10% for 20 minutes, the VM is likely in a state that it is running slower than it should. When this happens:
      1. Shut down the instance and move it to another physical server
      2. If steal time remains high, increase the CPU resources
      3. If steal time remains high, contact your hosting provider. Your host may be overselling physical servers.

    Test to see the steal value changed:

    1. start KVM
    2. 
      $ virsh start ubuntu18.04
      
    3. Try logging into the VM from the host
    4. 
      $ ssh jerry@192.168.122.145
      
      Run top then monitor the steal value:
      
      %Cpu(s):  0.3 us,  3.0 sy, 96.4 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
      
        PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                     
       1934 root      39  19  226624 115428  71280 R 99.7  5.7   8:02.91 unattended-upgr  
      
    5. Host status
      %Cpu(s): 26.2 us,  0.8 sy,  0.0 ni, 71.9 id,  0.7 wa,  0.0 hi,  0.3 si,  0.0 st
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                     
       3522 libvirt+  20   0 4802444 1.899g   4200 S 100.3 50.3  12:05.82 qemu-system-x86 
      
  • KiB Mem :
  • Status of physical memory: total, free, used and buff/cache. "used" includes “all memory allocated by system processes" and other categories of memory: “buffers” and “cache”. The Linux kernel attempts to use unused memory to improve performance: Disk data is cached in the “page cache, buffers+cache is the size of the page cache. Your real used memory is:
    
           ( used - buff/cache ) / total
    
    
  • KiB Swap:
  • Linux divides its physical RAM (random access memory) into chucks of memory called pages. Swapping is the process whereby a page of memory is copied to the preconfigured space on the hard disk, called swap space, to free up that page of memory. Mostly virtual memory: total, free, used and avail (which is physical memory). The avail number on line 2 is an estimation of physical memory available for starting new applications, without swapping.
Listed below are top's fields (columns) for available processes .
  • %CPU -- CPU Usage
  • The task's share of the elapsed CPU time since the last screen update. Press (Shift+P) to sort processes as per CPU utilization.
  • %MEM
  • Memory Usage, simply RES divided by total physical memory
  • RES
  • Resident Memory Size (KiB). A subset of the virtual address space (VIRT) representing the non-swapped physical memory a task is currently using. It is also the sum of the RSan, RSfd and RSsh fields.
    • RSlk
    • Resident Locked Memory Size (KiB). A subset of resident memory (RES) which cannot be swapped out.
    • RSan
    • Resident Anonymous Memory Size (KiB). A subset of resident memory (RES) representing private pages not mapped to a file.
    • RSfd
    • Resident File-Backed Memory Size (KiB). A subset of resident memory (RES) representing the implicitly shared pages supporting program images and shared libraries. It also includes explicit file mappings, both private and shared.
    • RSsh
    • Resident Shared Memory Size (KiB). A subset of resident memory (RES) representing the explicitly shared anonymous shm*/mmap pages.
    • SHR
    • Shared Memory Size (KiB). A subset of resident memory (RES) that may be used by other processes. It will include shared anonymous pages and shared file-backed pages. It also includes private pages mapped to files representing program images and shared libraries.
  • SWAP
  • Swapped Size (KiB). The formerly resident portion of a task's address space written to the swap file when physical memory becomes over committed.
  • USED
  • Memory in Use (KiB). This field represents the non-swapped physical memory a task is using (RES) plus the swapped out portion of its address space (SWAP).
  • VIRT
  • Virtual Memory Size (KiB). The total amount of virtual memory used by the task. It includes everything in-use and/or reserved: all code, data and shared libraries plus pages that have been swapped out and pages that have been mapped but not used.
  • COMMAND -- Command Name or Command Line
  • Press ‘c‘ option in running top command, it will display absolute path of running process.
  • PR
  • The kernel runs jobs based on their priorities, which are indicated in the PR field. In linux system priorities are 0 to 139 in which 0 to 99 for real time and 100 to 139 for users.
  • NI
  • Nice value is a user-space and priority PR is the process's actual priority that use by Linux kernel. The nice value range is -20 to +19: -20 is highest, 0 default and +19 is lowest. Relation between nice value and priority is :
    PR = 20 + NI
    
    PR [0 - 39] maps 100 to 139. To re-nice a process, hit 'R' , then top will ask for the input:
    
    PID to renice [default pid = 20998]
    
    To change the priority of everything owned by user jerry:
    
    renice 20 -u jerry
    
  • Display Specific User Process
  • 
    $ top -u jerry
    
    
  • Highlight Running Process in Top
  • Press ‘z‘ option in running top command will display running process in red color.
  • Set ‘Screen Refresh Interval’ in Top
  • By default screen refresh interval is 3.0 seconds, same can be change pressing ‘d‘ option in running top command and change it.
  • Kill running process in Top
  • You can kill a process after finding PID of process by pressing ‘k‘ in running top command
  • SORTING of task window
  • Shift + command:
    
                    command   sorted-field                  supported
                    A         start time (non-display)      No
                    M         %MEM                          Yes
                    N         PID                           Yes
                    P         %CPU                          Yes
                    T        TIME+                         Yes
    
  • Get a snapshot of Top
  • 
    top -n 1 -b
    
  • Monitor only processes with specified process IDs
  • 
    $ top -p2100
    
    
  • Monitor memory in different mode
  • Press ‘m‘ option in running top command, for additional information on that special 4-way toggle.
  • Add more fields
  • Press 'f' to go to the list of fields and navigate to the option which you would like. Press "space" bar to select the field.
  • 顯示thread
  • 按'H'可以顯示所有的執行緒
    
    Threads: 1061 total,   1 running, 1007 sleeping,   0 stopped,   0 zombie
    
  • 查看多核 CPU 個別core的負載
  • 按數字'1'即會顯示每個核心的負載註況.
    
    %Cpu0  : 16.8 us,  2.3 sy,  0.0 ni, 79.2 id,  0.7 wa,  0.0 hi,  1.0 si,  0.0 st
    %Cpu1  : 12.8 us,  4.0 sy,  0.0 ni, 80.5 id,  1.3 wa,  0.0 hi,  1.3 si,  0.0 st
    %Cpu2  : 16.9 us,  4.7 sy,  0.0 ni, 76.7 id,  1.7 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu3  : 18.9 us,  3.0 sy,  0.0 ni, 78.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem :  3960624 total,   447004 free,  2951536 used,   562084 buff/cache
    KiB Swap:  2097148 total,   882012 free,  1215136 used.   672356 avail Mem 
    
    
  • 顯示某個process使用哪個core
  • 在 top 畫面下,按'f'進入"Fields Management for window", 游標移動到"P = Last Used Cpu (SMP)"後按下空白鍵選擇,就可以了解每一個執行緒是被那一個核心執行的。
    
      PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                P 
    32027 jerry     20   0 1262872 284452  76560 S 11.4  7.2   0:02.23 ThreadPoolForeg        2
    32146 jerry     20   0 1262872 284452  76560 S 10.4  7.2   0:00.86 ThreadPoolForeg        2
    32096 jerry     20   0 1262872 284452  76560 S 10.1  7.2   0:01.06 ThreadPoolForeg        1     
    
    

The next tool is vmstat. vmstat reports information about processes, memory, paging, block IO, traps, disks and cpu activity.


$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 915712 1730676  58020 577300    9   71   217   129 2001 1152 11  3 76 10  0

By default, vmstat produces output once. You can assign a value for delay (in seconds) after which the output is updated.
Analysis:
  • Swap
    • si: Amount of memory swapped in from disk (/s).
    • so: Amount of memory swapped to disk (/s).
  • IO
    • bi: Blocks received from a block device (blocks/s).
    • bo: Blocks sent to a block device (blocks/s).
  • System
    • in: The number of interrupts per second, including the clock.
    • cs: The number of context switches per second.
How to identify which process is taking heavy IO in linux?
Virtual memory is a memory management technique used by Linux that combine active RAM and inactive memory on the disk drive (hard disk / ssd) to form a large range of contiguous addresses.
A page fault occurs when a process accesses a page that is mapped in the virtual address space, but not loaded in physical memory. 中央處理器的記憶體管理單元發出中斷信號。通常情況下,用於處理此中斷的程式是作業系統的一部分。如果作業系統判斷此次存取是有效的,那麼作業系統會嘗試將相關的分頁從硬碟上的虛擬記憶體檔案中調入記憶體。而如果存取是不被允許的,那麼作業系統通常會結束相關的行程。
A major fault occurs due to disk access, a minor fault occurs due to page allocation (already in memory, but it isn't allocated to that process).
Use "ps" or "top" can get the statistics of page fault.

Disks

You can use pam_limits, or the ulimit utility to keep users from going overboard after they log in the system.
The df -h command shows disk usage/free disk statistics for all mounted filesystems. To find out the identity of the disk hog for a folder:

  du  /home/* | sort –n
iostat - Report CPU statistics and IO statistics for devices and partitions.
When the command "iostat" is run without arguments, it generates a detailed report containing information since the system was booted. While each subsequent report covers the time period since the last report was generated.
You can provide two optional parameters to change this:

  iostat [option] [interval] [count]

  • interval
  • parameter specifies the duration of time in seconds between each report
  • count
  • parameter allows you to specify the number of reports that are generated before iostat exits.
The iostat command generates two types of reports,
  • the CPU Utilization report
  • This is similar to top.
  • the Device Utilization report
  • The device report provides statistics on a per physical device or partition basis.
    
    $ iostat -d
    Linux 5.0.0-37-generic (jerry-Latitude-E6410)  廿廿年一月四日  _x86_64_ (4 CPU)
    
    Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              12.65       208.85       180.56    3623705    3132796
    
    
    • Device
    • This column gives the device (or partition) name as listed in the /dev directory.
    • tps
    • Indicate the number of transfers per second that were issued to the device.
    • kB_read/s
    • the amount of data read from the device
    • kB_wrtn/s
    • the amount of data written to the device
    • kB_read
    • The total number of blocks read
    • kB_wrtn
    • The total number of blocks written.

Bandwidth

The command lsof lists open files including socket files.
$ lsof -U -Pi | grep IP
COMMAND     PID  USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME

chrome     2372 jerry   90u  IPv4            6179770      0t0     UDP jerry-Latitude-E6410:59666 
chrome     2372 jerry  205u  IPv4            5864118      0t0     UDP 224.0.0.251:5353 
chrome     2410 jerry   30u  IPv6            5861252      0t0     TCP 2001-b011-5003-14ec-2976-d29d-44a9-5cd6.dynamic-ip6.hinet.net:32836->th-in-xbc.1e100.net:5228 (ESTABLISHED)
chrome     2410 jerry   33u  IPv6            6180782      0t0     UDP 2001-b011-5003-14ec-2976-d29d-44a9-5cd6.dynamic-ip6.hinet.net:37080->tsa03s06-in-x0e.1e100.net:443 
chrome     2410 jerry   42u  IPv4            5865030      0t0     UDP 224.0.0.251:5353 

#70 Reduce Restart Times with Journaling Filesystems

Computer systems can only successfully mount and use filesystems if they can be sure that all of the data structures in each filesystem are consistent. "consistency" means that:
  • all of the disk blocks that are actually used in some file or directory are marked as being in use
  • all deleted blocks aren’t linked to anything other than the list of free blocks
  • all directories in the filesystem actually have parent directories
  • ...
Each filesystem has its own version of fsck that understands and “does the right thing” for that particular filesystem. When filesystems are mounted, they are marked as being in use (“dirty”). When a system is shut down normally, all its on-disk filesystems are marked as being consistent (“clean”) when they are unmounted. When the system reboots, filesystems that are marked as being clean do not have to be checked before they are mounted, which saves lots of time in the boot process. A journaling filesystem is a filesystem that maintains a special file called a journal that is used to repair any inconsistencies that occur as the result of an improper shutdown of a computer. Journaling filesystems write metadata (i.e., data about files and directories) into the journal that is flushed to the HDD before each command returns. If a given set of updates have not been fully committed to the filesystem, the system will read the journal then roll up to the most recent point of data consistency. The most commonly used journaling filesystem for Linux is the third extended filesystem (ext3fs), the newest journaling filesystem for Linux is ext4,

#71 Optimize Your System with sysctl

The files under /proc/sys/ are often collectively referred to as the sysctl interface, because they can be written to, and changes made to the files will be picked up by the running kernel without rebooting. Now, sysctl is also a command that allows administrators to easily configure kernel parameters at runtime. Display all values currently available,
$ sysctl -a
this returned many “key=value”-formatted records. The keys on the left are dotted representations of file paths under /proc/sys. For example, the setting for net.ipv4.ip_forward can be found in /proc/sys/net/ipv4/ip_forward. You can specify what you want as an argument to sysctl :
  • read
  • $ sysctl net.ipv4.ip_forward
    net.ipv4.ip_forward = 0
    
  • write
  • $ sysctl -w net.ipv4.ip_forward=1
    net.ipv4.ip_forward = 1
    
If you want to make a permanent change, you can put your custom settings into the /etc/sysctl.conf file.

#74 Profile Your Systems Using /proc

#!/bin/bash
echo ""
echo "#########BASIC SYSTEM INFORMATION########"
echo HOSTNAME: `cat /proc/sys/kernel/hostname`
echo DOMAIN: `cat /proc/sys/kernel/domainname`
echo KERNEL: `uname -r`
top -b | head -8
  • "top -b" starts top in Batch mode, which could be useful for sending output from top to other programs or to a file.
  • "head -8" prints the first 8 lines of each FILE to standard output.
echo "######## FILESYSTEM INFORMATION #########"
echo ""
echo "SUPPORTED FILESYSTEM TYPES:"
echo ----------------------
echo `cat /proc/filesystems | awk -F'\t' '{print $2}'`
echo ""
echo "MOUNTED FILESYSTEMS:"
echo ----------------------
cat /proc/mounts

#75 Kill Processes the Right Way

First use the ps –ef command to determine the process ID, then simply type this command:
$ kill -pid
The “zombie processes" is that child process has "died" but has not yet been "reaped" by its parent process(use wait() to read child's exit status). Also, unlike normal processes, the kill command has no effect on a zombie process. In order to prevent “zombie processes", you should make sure that you kill any and all child processes before you kill their respective parent processes. When a child process is terminated, the kernel keeps some information about it in the process table (including its exit status). The parent needs to read the exit status of the child before it removes the child’s entry from the table. When a process is dead, all resources associated with it are deallocated so that they can be reused by other processes. The problem caused by the zombie processes is: there is only one process table per system and this table has a limited number of unique processes identifiers (PIDs). If you have too many entries in this table, it won’t be able to create a new one. By looking at the column labeled PPID(parent process ID) on the output of "ps -ef", you can find the child process of the PID. Alternatively, you can attempt to kill all the processes within the same process group using killall.

Logfiles and Monitoring

#78 Avoid Catastrophic Disk Failure

ATA and SCSI drives have supported a standard mechanism for disk diagnostics called “Self Monitoring, Analysis, and Reporting Technology” (SMART), aimed at predicting hard drive failures. The smartmontools project (http://smartmontools.sourceforge.net) produces a SMART monitoring daemon called smartd and a command-line utility called smartctl, which can do most things on demand that the daemon does in the background periodically. To find the information of a hard drive:

$ sudo smartctl -i /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.0.0-37-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black
Device Model:     WDC WD2500BEKT-75A25T0
Serial Number:    WD-WXQ1A80V7620
LU WWN Device Id: 5 0014ee 655d0b10e
Firmware Version: 01.01A01
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Dec  9 18:43:42 2019 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

To ask the drive about its overall health:

$ sudo smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.0.0-37-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

To get all information,

$ sudo smartctl -a /dev/sda

#79 Monitor Network Traffic with MRTG

The Multi-Router Traffic Grapher provides a quick visual snapshot of network traffic, making it easy to find and resolve congestion. Each time you run MRTG, you’ll need to specify the location from which you want it to read the config file.

#86 Fine-Tune the syslog Daemon


dmesg is used to examine or control the kernel ring buffer. The default action is to display all messages from the kernel ring buffer.
klogd reads the messages from either /proc/kmsg or calls sys_syslogd. It provides kernel log data streams to another daemon called syslogd

A system daemon syslog listens for messages on a Unix domain socket named /dev/log. Based on classification information in the messages and its configuration file (usually /etc/syslog.conf), syslogd routes them in various ways. Some of the popular routings are:
  • Write to the system console
  • Mail to a specific user
  • Write to a log file( /var/log/messages)
  • Pass to another daemon
  • Discard
syslog can handle messages from the kernel itself. But the kernel doesn’t write to /dev/log; rather, another daemon (sometimes called “ klogd”) extracts messages from the kernel and passes them on to syslog as any other process would

Therefore, both log information are directed to the same log file /var/log/messages.
dmesg is most useful in capturing boot-time messages before syslogd started,

rsyslogd is derived from the sysklogd package. Support of both internet and unix domain sockets enables this utility to support both local and remote logging.


USING Security Enhanced Linux (SELinux)


Chapter 1. Getting started with SELinux


Discretionary(自由裁量權) access control (DAC) is a means of restricting access to objects based on the identity of subjects and/or groups to which they belong.
Mandatory(強制性的) Access Control (MAC) constrains the ability of a subject ( or initiator ) to access or generally perform some sort of operation on an object ( or target ).
In practice,
  • a subject is usually a process or thread;
  • objects are constructs such as files, directories, TCP/UDP ports, shared memory segments, IO devices, etc.
Any operation by any subject on any object is tested by the kernel against the set of authorization rules (aka policy) to determine if the operation is allowed.
For example: May a web server access files in users' home directories?
This can enable system administrators to create comprehensive and fine-grained security policies, such as restricting specific applications to only viewing log file.
The main difference between MAC and DAC:
  • With mandatory access control (MAC)
  • This security policy is centrally controlled by a security policy administrator; users do not have the ability to override the policy , for example, grant access to files that would otherwise be restricted. This is a organization-wide security policies. This allows security administrators to define a central policy that is guaranteed (in principle) to be enforced for all users.
  • with discretionary access control (DAC)
  • This allows users the ability to make policy decisions and/or assign security attributes.

SELinux implements MAC.
Every process and system resource has a special security label called an SELinux label/context.

NOTE: Remember that SELinux policy rules are checked after DAC rules. SELinux policy rules are not used if DAC rules deny access first

1.2. Benefits of running SELinux


SELinux provides the following benefits:
  • All processes and files are labeled.
  • Fine-grained access control.
  • SELinux policy is administratively-defined and enforced system-wide.

1.4. SELinux architecture and packages


Linux Security Modules (LSM) is a framework(interface) that allows the Linux kernel to support a variety of computer security models. LSM is built into the Linux kernel. SELinux and AppArmor are implementations of LSM. Both SELinux and AppArmor provide a set of tools to isolate applications from each other to protect the host system from being compromised.

Only a single LSM is allowed to be operational at a time.

The SELinux subsystem in the kernel is driven by a security policy which is controlled by the administrator and loaded at boot. All security-relevant, kernel-level access operations on the system are intercepted by SELinux and examined in the context of the loaded security policy.

By default, Ubuntu uses AppArmor and not SeLinux, which is similar in terms of performance but rather popular in terms of simplicity.

AppArmor has to be disabled prior to installing SeLinux to avoid any conflicts. Use the following instructions to disable AppArmor:

$ sudo systemctl status apparmor
● apparmor.service - AppArmor initialization
   Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sun 2020-01-12 10:17:46 CST; 29min ago
     Docs: man:apparmor(7)
           http://wiki.apparmor.net/
  Process: 477 ExecStart=/etc/init.d/apparmor start (code=exited, status=0/SUCCESS)
 Main PID: 477 (code=exited, status=0/SUCCESS)

 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]:  * Starting AppArmor profiles
 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]: Skipping profile in /etc/apparmor.d/disable: usr
 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]: Skipping profile in /etc/apparmor.d/disable: usr
 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]:    ...done.
 一  12 10:17:34 jerry-Latitude-E6410 systemd[1]: Starting AppArmor initialization...
 一  12 10:17:46 jerry-Latitude-E6410 systemd[1]: Started AppArmor initialization.

$ sudo systemctl stop apparmor

$ sudo systemctl disable apparmor


Note, SELinux is not tested for Ubuntu and not recommended run on Ubuntu. Try RedHat.

由於 SELinux 是在kernel中實踐的,應用程式無須被特別編寫或重寫便可以採用 SELinux。當然,如果一個程式特別留意稍後所提及的 SELinux 錯誤碼,它的運作可能會更暢順。假若 SELinux 攔阻了一個行動,它會以一個標準的(至少是常規的)「拒絕存取」類錯誤來匯報給該應用程式。然而,很多應用程式不會測試系統函數所傳回的錯誤碼,因此它們也許不會輸出訊息解釋問題所在,或者輸出錯誤訊息。

1.5. SELinux states and modes


SELinux can run in one of three modes:
  • Enforcing mode
  • It enforces the loaded security policy on the entire system.
  • Permissive mode
  • It enforces the loaded security policy on the entire system, but, it does not actually deny any operations, it just emits access denial entries in the logs.
  • Disabled mode
  • The policy is not enforced and persistent objects (such as files) are not labeled. This makes it difficult to enable SELinux in the future.
Use the setenforce utility to change between enforcing and permissive mode and getenforce to get the setting.

# getenforce
Enforcing

# setenforce 0
# getenforce
Permissive

# setenforce 1
# getenforce
Enforcing

To get the status of a system running SELinux:

$ sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          permissive
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      31


Chapter 2. Changing SELinux states and modes


While enabling SELinux on systems that previously had it disabled, to avoid problems, follow this procedure:
  • Enable SELinux in permissive mode.
  • Reboot your system.
  • Check for SELinux denial messages.
  • If there are no denials, switch to enforcing mode. For more information, see Changing to enforcing mode.

2.2.1. Changing to permissive mode


  • Modify /etc/selinux/config
  • SELINUX=permissive
  • Reboot the system

2.2.2. Changing to enforcing mode


  • Modify /etc/selinux/config
  • SELINUX=enforcing
  • Reboot the system

2.3. Disabling SELinux


  • Modify /etc/selinux/config
  • SELINUX=disabled
  • Reboot the system

2.4. Changing SELinux modes at boot time


On boot, you can set several kernel parameters to change the way SELinux runs:
  • enforcing=0
  • Cause the machine to boot in permissive mode. Using permissive mode might be the only option to detect a problem if your file system is too corrupted.
  • selinux=0
  • Cause the kernel to not load any part of the SELinux infrastructure.
  • autorelabel=1
  • Force the system to relabel.

Chapter 3. Configuring SELinux for applications and services with non-standard configurations


3.1 Customizing the SELinux policy for the Apache HTTP server in a non-standard configuration


Prerequisites: the Apache HTTP server is configured to listen on TCP port 3131 . Modify /etc/httpd/conf/httpd.conf:

Listen 3131


Procedure:
  • Start the httpd service and check the status
  • 
    $ systemctl start httpd
    $ systemctl status httpd
    ...
       Status: "Running, listening on: port 3131"
    ...
    
    SELinux 預設會透過Linux Auditing System 'auditd' 將日誌寫在 /var/log/audit/audit.log 內. SELinux 的日誌都以 AVC 這個關鍵字作標籤,讓 grep 等程式可輕易地把它們從其它信息中過濾出來。 Check the log,
    
    $ journalctl | grep avc
     一  14 16:02:32 localhost.localdomain audit[2398]: AVC avc:  denied  { name_bind } for  pid=2398 comm="httpd" src=3131 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket permissive=1
    
    
  • re-label
  • 在啟用已停用的 SELinux 時, 需要將整個檔案系統重新標籤。要在重新開機後自動將整個檔案系統重新標籤,請執行:
    
    # touch /.autorelabel
    # reboot 
    

semanage — SELinux Policy Management tool



Non-uniform memory access (NUMA)


Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing.
Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).

Introduction


Hyper-Threading


A single physical CPU core with hyper-threading appears as two logical CPUs to an operating system.
Hyper-threading allows the two logical CPU cores to share physical execution resources. This can speed things up somewhat — if one virtual CPU is stalled and waiting, the other virtual CPU can borrow its execution resources.

Multiple Cores


A dual-core CPU has two central processing units, so it appears to the operating system as two CPUs. A CPU with two cores could run two different processes at the same time. There only needs to be a single CPU socket with a single physical CPU to with four different CPUs,

Multiple CPUs


To add additional processing power to computers by adding additional CPUs, this requires a motherboard with multiple CPU sockets and additional hardware to connect those CPU sockets to the RAM and other resources.
Most multi-processor computers are considered Symmetric Multi-Processors(SMP) as each processor is equal and has equal access to all system resources (e.g., memory and I/O busses).
As SMP systems have increased their processor count, the system bus has increasingly become a bottleneck.

CPU Affinity(親和力)


The ability in Linux to bind one or more processes to one or more processors called CPU affinity.
The idea is to say “always run this process on processor one” or “run these processes on all processors but processor zero”.
CPU affinity is a scheduler property that "bonds" a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs.
taskset is used to set or retrieve the CPU affinity of a running process given its pid, or to launch a new command with a given CPU affinity.

NUMA


Historically, all memory on AMD64 and Intel 64 systems is equally accessible by all CPUs. Known as Uniform Memory Access (UMA), access times are the same no matter which CPU performs the operation.
Non-Uniform Memory Access (NUMA) refers to multiprocessor systems whose memory is divided into multiple memory nodes. The access time of a memory node depends on the relative locations of the accessing CPU and the accessed node.
Thus, there are multiple physical regions of memory, but all memory is tied together into a single cache-coherent physical address space. The resulting system has the property such that for any given region of physical memory, some processors are closer to it than other processors. Conversely, for any processor, some memory is considered local (i.e., it is close to the processor) and other memory is remote.

To maximize performance on a NUMA platform, Linux must take into account the way the system resources are physically laid out. This includes information such as which CPUs are on which node, which range of physical memory is on each node, and what node an I/O bus is connected to. This type of information describes the topology of the system.


NUMA把系統切成數個節點 (node),每個處理器及記憶體就位在某一個節點上,當處理器存取同一個節點的記憶體時,可以有較高的存取速度;而存取其他節點的記憶體時,就需要透過節點間的資料傳遞,會耗費較多時間。


Linux NUMA Support


作業系統中為了提高記憶體存取的效率,會針對硬體的NUMA配置來設定記憶體存取的策略 (policy),並提供NUMA相關的程式介面 (API) 來查詢系統NUMA配置和修改存取策略。

Linux manages memory in zones. How Linux has arranged memory can be determined by looking at /proc/zoneinfo.
On boot-up, Linux will detect the organization of memory and then create zones that map to the NUMA nodes and DMA areas as needed.

NUMA Memory Allocation Policies


How memory is allocated from the nodes in a system is determined by a memory policy:
The most important memory policies are:
  • interleave
  • During boot up, the system default policy will be set to interleave allocations across all nodes with “sufficient” memory, so as not to overload the initial boot node with boot-time allocations. Allocation occurs round-robin. First a page will be allocated from node 0, then from node 1, then again from node 0, etc. Interleaving is used to distribute memory accesses for structures that may be accessed from multiple processors in the system in order to have an even load on the interconnect and the memory of each node.
  • local allocation
  • When the system is “up and running”, when the first userspace process (init daemon) is started, the system default policy will be changed to “local allocation”. The allocation occurs from the memory node local to where the code is currently executing.

Basic Operations On Process Startup


The main tool used to set up the NUMA execution environment for a process is numactl.
numactl controls NUMA policy for processes or shared memory. It is possible to restrict processes to a set of processors, as well as to a set of memory nodes.

The hardware NUMA configuration of a system can be viewed by using

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 3867 MB
node 0 free: 931 MB
node distances:
node   0 
  0:  10 


numastat displays per-node NUMA hit and miss system statistics from the kernel memory allocator.

$ numastat
                           node0
numa_hit               103812126
numa_miss                      0
numa_foreign                   0
interleave_hit             33350
local_node             103812126
other_node                     0
  • numa_hit
  • is memory successfully allocated on this node as intended.
  • numa_miss
  • is memory allocated on this node despite the process preferring some different node. Each numa_miss has a numa_foreign on another node.
  • numa_foreign
  • is memory intended for this node, but actually allocated on some different node. Each numa_foreign has a numa_miss on another node.
  • interleave_hit
  • is interleaved memory successfully allocated on this node as intended.
  • local_node
  • is memory allocated on this node while a process was running on it.
  • other_node
  • is memory allocated on this node while a process was running on some other node.

The information about a process's NUMA memory policy and allocation can be displayed via /proc/[pid]/numa_maps. For ex.,

$ sudo cat /proc/2907/numa_maps 
3473fdedd000 default
3473fdede000 default anon=14265 dirty=14265 active=6585 N0=14265 kernelpagesize_kB=4
...
7fb1ad2ca000 default file=/lib/x86_64-linux-gnu/ld-2.27.so anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
7fb1ad2cb000 default anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
7ffc1dbb3000 default stack anon=29 dirty=29 active=20 N0=29 kernelpagesize_kB=4
7ffc1dbe2000 default
7ffc1dbe5000 default

Each line contains information about a memory range used by the process.
  • The first field of each line shows the starting address of the memory range.
  • This field allows a correlation with the contents of the /proc/[pid]/maps file, which contains the end address of the range and other information, such as the access permissions and sharing.
  • The second field shows the memory policy currently in effect for the memory range.
  • anon=[pages]
  • The number of anonymous page in the range.
  • stack
  • Memory range is used for the stack.
  • file=[filename]
  • The file backing the memory range.
  • dirty=[pages]
  • Number of dirty pages.
  • N[node]=[nr_pages]
  • The number of pages allocated on [node].

Create a Linux Swap File

When a Linux system runs out of RAM, inactive pages are moved from the RAM to the swap space.
Swap space can take the form of either a dedicated swap partition or a swap file.

How to add Swap File

  • Create a file that will be used for swap
  • To create a 1GB swap file
    
    sudo dd if=/dev/zero of=/swapfile bs=1024 count=1048576
    
    This caused the swap file initialized with 0.
  • Only the root user should be able to write and read the swap file.
  • 
    sudo chmod 600 /swapfile
    
  • Use the mkswap utility to set up the file as Linux swap area
  • 
    sudo mkswap /swapfile
    
  • Enable the swap with the following command
  • 
    sudo swapon /swapfile
    
  • To make the change permanent open the /etc/fstab
  • 
    /swapfile swap swap defaults 0 0
    
  • To verify that the swap is active
  • 
    sudo swapon --show
    

Cpusets on Linux


在個人使用的環境下,想要調效多核心環境的計算效能,Linux核心提供的CPUSET功能會是一個簡單而方便的操作方式。
Cpusets are logical, hierarchical groupings of CPUs and units of memory.
The cpuset facility is primarily a workload manager tool permitting a system administrator to restrict the number of processor and memory resources that a process or set of processes may use.
  • A cpuset defines a list of CPUs and memory nodes.
  • A process contained in a cpuset may only execute on the CPUs in that cpuset and may only allocate memory on the memory nodes in that cpuset.
  • Essentially, cpusets provide you with a CPU and memory containers or “soft partitions” within which you can run sets of related tasks.

Linux 2.6 Kernel Support for Cpusets


  • Each task has a link to a cpuset structure that specifies the CPUs and memory nodes available for its use.
  • The kernel task scheduler is constrained to only schedule a task on the CPUs in that task's cpuset.
  • The kernel memory allocation mechanism is constrained to only allocate physical memory to a task from the memory nodes in that task's cpuset.

Cpuset Facility Capabilities


The cpuset facility allows you and your system service software to do the following:
  • Create and delete named cpusets.
  • Decide which CPUs and memory nodes are available to a cpuset.
  • Attach a task to a particular cpuset.
  • Identify all tasks sharing the same cpuset.
  • Exclude any other cpuset from overlapping a given cpuset, thereby, giving the tasks running in that cpuset exclusive use of those CPUs and memory nodes.
  • Perform bulk operations on all tasks associated with a cpuset, such as varying the resources available to that cpuset or hibernating those tasks in temporary favor of some other job.
  • Perform sub-partitioning of system resources using hierarchical permissions and resource management.

Initializing Cpusets


The kernel, at system boot time, initializes one cpuset, the root cpuset, containing the entire system's CPUs and memory nodes. Subsequent user space operations can create additional cpusets.
Mounting the cpuset virtual file system (VFS) at /dev/cpuset exposes the kernel mechanism to user space.

How to Determine if Cpusets are Installed


Check if the /proc/filesystems contains cpusets,

$ grep cpuset /proc/filesystems
nodev cpuset

If the /dev/cpuset/tasks file is not present on your system, it means the cpuset file system is not mounted. you can mount the cpuset file system, as follows:

$ sudo mkdir /dev/cpuset
$ sudo mount -t cpuset cpuset /dev/cpuset
$ ls /dev/cpuset
cgroup.clone_children  cpuset.effective_cpus  cpuset.memory_pressure          cpuset.sched_load_balance        tasks
cgroup.procs           cpuset.effective_mems  cpuset.memory_pressure_enabled  cpuset.sched_relax_domain_level
cgroup.sane_behavior   cpuset.mem_exclusive   cpuset.memory_spread_page       machine
cpuset.cpu_exclusive   cpuset.mem_hardwall    cpuset.memory_spread_slab       notify_on_release
cpuset.cpus            cpuset.memory_migrate  cpuset.mems                     release_agent


Cpuset File System Directories


Each cpuset is represented by a directory in the cpuset virtual file system.
The state of each cpuset is represented by small text files in the directory for the cpuset. These files may be read and written using traditional shell utilities or using ordinary file access routines from programming languages.
Descriptions of the files in the cpuset directory,
  • tasks
  • List of process IDs (PIDs) of tasks in the cpuset. The list is formatted as a series of ASCII decimal numbers, each followed by a newline. A task may be added to a cpuset (removing it from the cpuset previously containing it) by writing its PID to that cpuset's tasks file (with or without a trailing newline.)
  • notify_on_release
  • A flag

WiFi Debug


How to hide MAC address?


You are not going to make modification on hardware, you going to change RAM. When the computer starts, MAC address loads in RAM and we going to change already loaded MAC address.

  • install macchanger
  • 
    $ sudo apt-get install macchanger
    
  • Turn off your network interface
  • 
    # ifconfig eno1 down
    
  • Use macchanger to randomly generate new MAC address and assign it to eno1
  • 
    $ sudo macchanger -r eno1
    Current MAC:   5c:26:xx:xx:xx:xx (Dell Inc.)
    Permanent MAC: 5c:26:xx:xx:xx:xx (Dell Inc.)
    New MAC:       f2:68:fb:e3:61:54 (unknown)
    
    

Wireshark


Wireshark can be used to capture packet directly:

$sudo apt-get install wireshark
$sudo wireshark -i wlan0 -w  


Wireless modes


Basically, the default "managed" mode of your networking card allows the networking device to receive packets that are sent to its MAC address.

$ sudo iwconfig wlp2s0
wlp2s0    IEEE 802.11  ESSID:"Jerry_DSL-5G"  
          Mode:Managed  Frequency:5.745 GHz ...


You can let the card enter "monitor" mode by typing in these commands

$ sudo ifconfig wlp2s0 down
$ sudo iwconfig wlp2s0 mode monitor
$ sudo ifconfig wlp2s0 up
$ sudo iwconfig wlp2s0


Aircrack-ng


Aircrack-ng is a whole suite of tools for Wireless Security Auditing. It can be used to monitor, test, crack or attack Wireless Security Protocols like WEP, WPA, WPA2. Aircrack-ng is command line based and is available for Windows and Mac OS and other Unix based Operating systems.
We’ll only look at some important tools that are used more often in Wireless Security testing.

airodump-ng is used for packet capturing of raw 802.11 frames for the intent of using them with aircrack-ng.

Installation


sudo apt-get install -y aircrack-ng

Usage


  • kill all the processes running on wireless card using airmon-ng
  • 
    $sudo airmon-ng check kill
    
  • start Monitor mode on wireless card
  • 
    $sudo airmon-ng start wlp2s0
    
  • Run ‘iwconfig’ again to list wireless details.
  • 
    $sudo iwconfig
    
    airmon-ng has started Monitor mode on wireless card, it’ll appear as different name
  • search nearby Wireless Access Points and their properties.
  • 
    $sudo airodump-ng wlan0mon0
    
    You can narrow down search using MAC (–bssid) and channel (-c) filters.
  • Start airodump-ng on the target AP
  • 
    $sudo airodump-ng --channel [channel] –bssid [bssid] –write [file-name] wlan0mon0
    
  • Wait for a client to connect to the access point, or deauthenticate a connected client (if any) so that their system will connect back automatically.
  • 
    $sudo aireplay-ng –deauth [number of deauth packets] –a [bssid] –c [target_client_mac] wlan0mon0
    
  • After all devices are disconnected from that Access Point, running airodump-ng will capture the handshake
  • If the handshake catched, it’ll appear at the top of running airodump-ng. Handshake is stored in ‘/tmp/’ directory, and contains encrypted password that can be brute forced offline using a dictionary.
  • hack password
  • Handshake is stored in ‘/tmp/’ directory, and contains encrypted password that can be brute forced offline using a dictionary. You can download this wordlist from the following website:
    • http://www.hackreports.com/2013/05/biggest-password-cracking-wordlist-with.html
    • https://crackstation.net/buy-crackstation-wordlist-password-cracking-dictionary.htm
    aircrack-ng tries to crack the key by combining each password in the wordlist with access point name (essid) to compute a Pairwise Master Key (PMK), the PMK is the then compared to the handshake file.
    
    $sudo aircrack-ng [handshake_filename] –w [dictionary_wordlist] wlan0mon0
    
  • stop Monitor mode on wireless card
  • 
    $ sudo airmon-ng stop wlan0mon
    

The Linux-PAM configuration file


/etc/pam.conf is made up of a list of rules:

service type control module-path module-arguments

  • service
  • The service is the name of the file in the /etc/pam.d/ directory which defines the rule:
    
    /etc/pam.d
    ├── chfn
    .
    ├── cron
    .
    ├── login
    .
    ├── passwd
    .
    └── systemd-user
    
    
  • type
  • The type is the management group that the rule corresponds to. Valid:
    • account
    • auth
    • password
    • session
  • control
  • module-path
  • module-arguments

Run a shell script as another user without password



sudo -H -u otheruser bash -c 'echo "I am $USER, with uid $UID"' 

  • -H
  • The -H (HOME) option requests that the security policy set the HOME environment variable to the home directory of the target user (root by default) as specified by the password database. Depending on the policy, this may be the default behavior.
  • -u
  • The -u (user) option causes sudo to run the specified command as a user other than root. To specify a uidinstead of a user name, use #uid. When running commands as a uid, many shells require that the '#' be escaped with a backslash ('\'). Security policies may restrict uids to those listed in the password database. The sudoers policy allows uids that are not in the password database as long as the targetpw option is not set. Other security policies may not support this.

You can modify the /etc/pam.d/su file to allow su without password.
With the following modification in /etc/pam.d/su, any user that was part of group somegroup could su to otheruser without a password:

auth       sufficient pam_rootok.so
auth       [success=ignore default=1] pam_succeed_if.so user = otheruser
auth       sufficient   pam_succeed_if.so use_uid user ingroup somegroup


ip COMMAND CHEAT SHEET

Discussions of Networking Issues

Unable to use ping as regular user

問題描述: 原因分析:

Description

Linux系統下,普通用戶使用ping命令返回“ping: icmp open socket: Operation not permitted”錯誤輸出,但root用戶可以正常使用該命令。

$ ip a
5: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:48:5f:45:73 brd ff:ff:ff:ff:ff:ff
    inet 192.168.168.170/24 brd 192.168.168.255 scope global dynamic noprefixroute enp4s0
       valid_lft 85616sec preferred_lft 85616sec
    inet6 2001:b011:5003:19dc:5b27:dfe9:ba8d:e6c/64 scope global dynamic noprefixroute 
       valid_lft 85618sec preferred_lft 85618sec
    inet6 fe80::7623:d501:113b:8e77/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
       
$ ping 192.168.168.1 -c 2 -w 3 -I enp4s0
ping: SO_BINDTODEVICE enp4s0: Operation not permitted

$ sudo ping 192.168.168.1 -c 2 -w 3 -I enp4s0
PING 192.168.168.1 (192.168.168.1) from 192.168.168.170 enp4s0: 56(84) bytes of data.
64 bytes from 192.168.168.1: icmp_seq=1 ttl=64 time=0.464 ms
64 bytes from 192.168.168.1: icmp_seq=2 ttl=64 time=0.499 ms

Solution


$ sudo chmod u+s /usr/bin/ping
$ ping 192.168.168.1 -c 2 -w 3 -I enp4s0
PING 192.168.168.1 (192.168.168.1) from 192.168.168.170 enp4s0: 56(84) bytes of data.
64 bytes from 192.168.168.1: icmp_seq=1 ttl=64 time=0.497 ms

Analysis

ping命令在運行中采用了ICMP協議,需要發送ICMP報文。但是只有root用戶才能建立ICMP報文。而正常情況下,ping命令的權限應為-rwsr-xr-x,即帶有suid的文件,一旦該權限被修改,則普通用戶無法正常使用該命令。

Changes/Enable SysctlPingGroupRange

Set Up the NFS server

server:

  • Install the NFS server
  • 
    $ sudo apt install nfs-kernel-server
    		
  • NFS server's configuration
  • /etc/exports:
    
    /home/jerrlee/work2 10.19.108.147/8(rw,sync,no_subtree_check,fsid=0,insecure_locks,insecure,no_root_squash)
    		
  • Modify folder's ownership
  • 
    $ sudo chown jerrlee /home/jerrlee/work2
    		
  • start the NFS server
  • 
    $ sudo systemctl restart nfs-kernel-server
    $ sudo service nfs-server status
    		
  • Export server's directories
  • 
    $ sudo exportfs -a
    		
  • Setup firewall
  • 
       sudo ufw allow  nfs
       sudo ufw allow sunrpc
       sudo ufw allow 111
       sudo ufw allow from 10.19.108.147/8  
       sudo ufw status
    		
client:
  • Install the client
  • 
       sudo apt install nfs-utils
       sudo apt install nfs-common
    		
  • Check NFS shares exported by the NFS server
  • 
    $ showmount -e build-server
    Export list for build-server:
    /home/jerrlee/work2 10.19.108.147/8
    		
  • setup the firewall
  • 
       sudo ufw allow ssh
       sudo ufw allow nfs
       sudo iptables --flush
       sudo ufw allow from 10.19.108.147/8
    		
  • test the connections with the server
  • 
       nc -v -u 10.19.108.147 111
       nc -v  10.19.108.147 111
       nc -v 10.19.108.147 2049
       nc -v -u 10.19.108.147 2049
       rpcinfo -p 10.19.108.147
    		
  • Mout the NFS share
  • 
       sudo mount -o v3 10.19.108.147:/home/jerrlee/work2 /home/jerry/work2
    		
  • /etc/fstab
  • 
       sudo mount -o v3 10.19.108.147:/home/jerrlee/work2 /home/jerry/work2
    		

留言

熱門文章