12月 08, 2019

Linux Server Tips

Linux OS installation, configuration

init, systemd

The init process is the first process that is run by the kernel at the end of bootstrap procedure.

In the past days:

init usually reads system-dependent initialization files from /etc/rc/* and bring the system to a state which is defined in the /etc/inittab file.
You can use the init command to reread this file or change to a new runlevel.
For example, to change from your current runlevel to runlevel 1 (the single-user mode), you can use the init 1 command.

"Runlevels" are an obsolete way to start and stop groups of services used in SysV init.
systemd provides a compatibility layer that maps runlevels to targets, and associated binaries like runlevel.
Mapping between runlevels and systemd targets:


       ┌─────────┬───────────────────┐
       │Runlevel │ Target            │
       ├─────────┼───────────────────┤
       │0        │ poweroff.target   │
       ├─────────┼───────────────────┤
       │1        │ rescue.target     │
       ├─────────┼───────────────────┤
       │2, 3, 4  │ multi-user.target │
       ├─────────┼───────────────────┤
       │5        │ graphical.target  │
       ├─────────┼───────────────────┤
       │6        │ reboot.target     │
       └─────────┴───────────────────┘

The command runlevel prints the previous and current SysV runlevel if they are known.


$ runlevel
N 5

The two runlevel characters are separated by a single space character. If a runlevel cannot be determined, N is printed instead. If neither can be determined, the word "unknown" is printed.


  /sbin/init -> /lib/systemd/systemd

systemd is a system and service manager for Linux operating systems. When run as first process on boot (as PID 1), it acts as init system that brings up and maintains userspace services.

When run as a system instance, systemd interprets the configuration file /etc/systemd/system.conf and the files in system.conf.d directories; when run as a user instance, systemd interprets the configuration file /etc/systemd/user.conf and the files in user.conf.d directories.

$ tree /etc/systemd -L 2
/etc/systemd
├── journald.conf
├── logind.conf
├── network
├── resolved.conf
├── system
│   ├── bluetooth.target.wants
│   ├── brltty.service -> /dev/null
│   ├── cloud-final.service.wants
│   ├── dbus-fi.w1.wpa_supplicant1.service -> /lib/systemd/system/wpa_supplicant.service
│   ├── dbus-org.bluez.service -> /lib/systemd/system/bluetooth.service
│   ├── dbus-org.freedesktop.Avahi.service -> /lib/systemd/system/avahi-daemon.service
│   ├── dbus-org.freedesktop.ModemManager1.service -> /lib/systemd/system/ModemManager.service
│   ├── dbus-org.freedesktop.nm-dispatcher.service -> /lib/systemd/system/NetworkManager-dispatcher.service
│   ├── dbus-org.freedesktop.resolve1.service -> /lib/systemd/system/systemd-resolved.service
│   ├── dbus-org.freedesktop.thermald.service -> /lib/systemd/system/thermald.service
│   ├── default.target.wants
│   ├── display-manager.service -> /lib/systemd/system/gdm3.service
│   ├── display-manager.service.wants
│   ├── final.target.wants
│   ├── getty.target.wants
│   ├── graphical.target.wants
│   ├── libvirt-bin.service -> /lib/systemd/system/libvirtd.service
│   ├── multi-user.target.wants
│   ├── network-online.target.wants
│   ├── oem-config.service.wants
│   ├── paths.target.wants
│   ├── printer.target.wants
│   ├── snap-core18-1279.mount
│   ├── snap-core18-1288.mount
│   ├── snap-core-8213.mount
│   ├── snap-core-8268.mount
│   ├── snap-gnome\x2d3\x2d26\x2d1604-97.mount
│   ├── snap-gnome\x2d3\x2d26\x2d1604-98.mount
│   ├── snap-gnome\x2d3\x2d28\x2d1804-110.mount
│   ├── snap-gnome\x2d3\x2d28\x2d1804-91.mount
│   ├── snap-gnome\x2dcalculator-536.mount
│   ├── snap-gnome\x2dcalculator-544.mount
│   ├── snap-gnome\x2dcharacters-367.mount
│   ├── snap-gnome\x2dcharacters-375.mount
│   ├── snap-gnome\x2dlogs-73.mount
│   ├── snap-gnome\x2dlogs-81.mount
│   ├── snap-gnome\x2dsystem\x2dmonitor-111.mount
│   ├── snap-gnome\x2dsystem\x2dmonitor-123.mount
│   ├── snap-gtk\x2dcommon\x2dthemes-1313.mount
│   ├── snap-gtk\x2dcommon\x2dthemes-1353.mount
│   ├── snap-vlc-1049.mount
│   ├── snap-vlc-1397.mount
│   ├── sockets.target.wants
│   ├── spice-vdagentd.target.wants
│   ├── sshd.service -> /lib/systemd/system/ssh.service
│   ├── sysinit.target.wants
│   ├── syslog.service -> /lib/systemd/system/rsyslog.service
│   ├── teamviewerd.service
│   └── timers.target.wants
├── system.conf
├── timesyncd.conf
├── user
│   └── default.target.wants
└── user.conf

舊的 init 啟動腳本是『一項一項任務依序啟動』的模式，因此不相依的服務也是得要一個一個的等待。但目前我們的硬體主機系統與作業系統幾乎都支援多核心架構了，沒道理未相依的服務不能同時啟動啊！systemd 就是可以讓所有的服務同時啟動，因此你會發現到，系統啟動的速度變快了！
systemd 只需要搭配 systemctl 指令來處理
systemd 可以自訂服務相依性的檢查，如果 B 服務是架構在 A 服務上面啟動的，systemd 會自動幫你啟動 A 服務
如同 systemV 的 init 裡頭有個 runlevel 的特色，systemd 亦將許多的功能集合成為一個所謂的 target 項目，這個項目主要在設計操作環境的建置，所以是集合了許多的 daemons，亦即是執行某個 target 就是執行好多個 daemon 的意思

以前，我們如果想要建立系統服務，就得要到 /etc/init.d/ 底下去建立相對應的 bash shell script 來處理。
查看和控制systemd的主要命令是systemctl。該命令可用查看系統狀態和管理系統及服務。

Analyzing the system state

Show system status


$ systemctl status

List running units


$ systemctl

List failed units


$ systemctl --failed

List installed unit files


$ systemctl list-unit-files

Using units

.service
.socket
.device
.mount
.automount
.swap
.target
.path
.timer
.snapshot
.slice
.scope

systemctl

unit file

suffix

.service

Start a unit immediately

start

Stop a unit immediately

stop

Restart a unit

restart

Ask a unit to reload its configuration

reload

Show the status of a unit, including whether it is running or not

status

Check whether a unit is already enabled or not

is-enabled

Enable a unit to be started on bootup

enable

Enable a unit to be started on bootup and Start immediately

enable --now

Disable a unit to not start during bootup

disable

Show the manual page associated with a unit (this has to be supported by the unit file)

help

Reload systemd manager configuration, scanning for new or changed units

daemon-reload

Writing unit files

Linux Server Hacks

Hack#2 Console Logging without password

Using GRUB to invoke bash,

Select the appropriate boot entry in the GRUB menu and press e to edit the line.
Select the kernel line and press e again to edit it.
Append init=/bin/bash at the end of line.

init=

Press Ctrl-X to boot (this change is only temporary and will not be saved to your menu.lst). After booting you will be at the bash prompt.
Your root file system is mounted as readonly now, so remount it as read/write

mount -n -o remount,rw /

Use the passwd command to create a new root password.
Mount filesystems manually
Reboot by typing reboot -f and do not lose your password again!

Hack#5 n&m

Each file in Linux has a corresponding File Descriptor associated with it.
The Bourne shell operator n&m rearranges the files and files operators.
This makes descriptor n point to the files of descriptor m.
Ex.,

output including error


$ ls test* test.none
ls: cannot access 'test.none': No such file or directory
test.c  testc.cpp  test.cpio.gz  test.cpp  testelf.c  test.sh

re-direct the standard output


$ ls test* test.none 1>'./stdout'
ls: cannot access 'test.none': No such file or directory
$ cat ./stdout
test.c
testc.cpp
test.cpio.gz
test.cpp
testelf.c
test.sh

re-direct the standard error to the standard output


$ ls test* test.none 1>'./stdout' 2>&1
$ cat ./stdout
ls: cannot access 'test.none': No such file or directory
test.c
testc.cpp
test.cpio.gz
test.cpp
testelf.c
test.sh

POSIX definition of Redirection Operator:
In the shell command language, a token that performs a redirection function.
Redirections are processed in the order they appear, from left to right.
It is one of the following symbols:

command < file.txt
command <> file.txt
command > file.txt
command >| file.txt
command << WORD


 command << WORD
     Text
 WORD

command >> file.txt
command <&num

num

command >&num

num

command <<-
command <>

Linux disk and filesystem management

/etc/fstab

The file fstab contains descriptive information about the filesystems the system can mount.
Each filesystem is described on a separate line. Fields on each line are separated by tabs or spaces.


(file system) (mount point)   (type)  (options)       (dump)  (pass)

file system

block device

LABEL=

UUID=

mount point
type
options
dump
pass

fsck

the order in which filesystem checks are done at boot time

root filesystem

/etc/fstab is a list of filesystems to be mounted at boot time.
/etc/mtab is a list of currently mounted filesystems.

How to determine/find UUID of a partition?

In Linux, UUID(Universally Unique Identifier) can identify media more accurately and reliable, identify media via /dev/hdxy or /dev/sdxy is not a good method because the order may be different between boots, so it was no longer preferred any more, especially in fstab or grub config.

libuuid is part of the util-linux-ng package since kernel version 2.15.1 and it’s installed by default in Linux system.
The UUIDs generated by this library can be reasonably expected to be unique within a system, and unique across all systems.
UUIDs are represented as 32 hexadecimal (base 16) digits, displayed in five groups separated by hyphens, in the form 8-4-4-4-12 for a total of 36 characters (32 alphanumeric characters and four hyphens).

How to find UUIDs of my hard disk partitions?


~$ tree /dev/disk/
/dev/disk/
├── by-id
│   ├── ata-TEAC_DVD-ROM_DV18SA_10091725083237 -> ../../sr0
│   ├── ata-WDC_WD2500BEKT-75A25T0_WD-WXQ1A80V7620 -> ../../sda
│   ├── ata-WDC_WD2500BEKT-75A25T0_WD-WXQ1A80V7620-part1 -> ../../sda1
│   ├── wwn-0x50014ee655d0b10e -> ../../sda
│   └── wwn-0x50014ee655d0b10e-part1 -> ../../sda1
├── by-partuuid
│   └── abfa7e81-01 -> ../../sda1
├── by-path
│   ├── pci-0000:00:1f.2-ata-1 -> ../../sda
│   ├── pci-0000:00:1f.2-ata-1-part1 -> ../../sda1
│   └── pci-0000:00:1f.2-ata-2 -> ../../sr0
└── by-uuid
    └── 3db7ffaf-51bc-4f72-a09d-5ec2f3904c08 -> ../../sda1

$ sudo blkid
[sudo] password for jerry: 
/dev/loop0: TYPE="squashfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"
/dev/loop4: TYPE="squashfs"
/dev/loop5: TYPE="squashfs"
/dev/loop6: TYPE="squashfs"
/dev/loop7: TYPE="squashfs"
/dev/sda1: UUID="3db7ffaf-51bc-4f72-a09d-5ec2f3904c08" TYPE="ext4" PARTUUID="abfa7e81-01"
/dev/loop8: TYPE="squashfs"
/dev/loop9: TYPE="squashfs"
/dev/loop10: TYPE="squashfs"
/dev/loop11: TYPE="squashfs"
/dev/loop12: TYPE="squashfs"

$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
#                
# / was on /dev/sda1 during installation
UUID=3db7ffaf-51bc-4f72-a09d-5ec2f3904c08 /               ext4    errors=remount-ro 0       1
/swapfile                                 none

How to generate a new UUID for a partition?

tune2fs allows the system administrator to adjust various tunable filesystem parameters on Linux ext2, ext3, or ext4 filesystems.


tune2fs  [  -l  ]  [ -c max-mount-counts ] [ -e errors-behavior ] [ -f ] [ -i interval-between-checks ] [ -I
       new_inode_size ] [ -j ] [ -J journal-options ] [ -m reserved-blocks-percentage ] [ -o [^]mount-options[,...]
       ] [ -r reserved-blocks-count ] [ -u user ] [ -g group ] [ -C mount-count ] [ -E extended-options ] [ -L vol‐
       ume-label ] [ -M last-mounted-directory ] [ -O [^]feature[,...]  ] [ -Q  quota-options  ]  [  -T  time-last-
       checked ] [ -U UUID ] [ -z undo_file ] device

The device specifier can either be a filename (i.e., /dev/sda1), or a LABEL or UUID specifier: "LABEL=volume-label" or "UUID=uuid". (i.e., LABEL=home or UUID=e40486c6-84d5-4f2f-b99c-032281799c9d).


~$ sudo tune2fs -U random /dev/sda1

tune2fs 1.44.1 (24-Mar-2018)
The UUID may only be changed when the filesystem is unmounted.

And in /etc/fstab, we should modify it to use the new UUID.

LinuxServer hacks

Hack#8 Immutable Files in ext2/ext3

Linux provides us the access control by file and directory permissions on three levels: user, group and other. These file permissions provide the basic level of security and access control.

The umask utility is used to control the file-creation mode mask, which determines the initial value of file permission bits for newly created files. Because umask affects the current shell execution environment, it is usually implemented as built-in command of a shell.


$ umask
0022
$ umask -S
u=rwx,g=rx,o=rx

Linux also has advanced access control features like ACLs (Access Control Lists) and attributes. Attributes define properties of files.

a: append only
c: compressed
d: no dump
e: extent format
i: immutable
j: data journalling
s: secure deletion
t: no tail-merging
u: undeletable
A: no atime updates
C: no copy on write
D: synchronous directory updates
S: synchronous updates
T: top of directory hierarchy

For ex., a file with the i attribute cannot be modified.
It cannot be deleted or renamed, no link can be created to this file and no data can be written to the file.
When set, prevents, even the superuser, from erasing or changing the contents of the file.

Some Linux-native filesystems support several attributes that you can adjust with the chattr command and use lsattr command to display the attributes of a file.

chattr


 chattr [-RVf] [-+=AacDdijsTtSu] [-v version] files...

lsattr


 lsattr [ -RVadv ] [ files...  ]

Hack#11 Finding and Eliminating setuid/setgid Binaries

Linux uses a combination of bits to store the permissions of a file. We can change the permissions using the chmod command, which essentially changes the ‘r’, ‘w’ and ‘x’ characters associated with the file.


$ chmod u=rwx filename
$ chmod go=rx filename
$ chmod g+w foobar
$ chmod a-w foobar

Note, a: all; use this instead of typing "ugo".

Further, the ownership of files also depends on the uid (user ID) and the gid (group ID) of the creator, when we launch a process, it runs with the uid(effective user-id) and gid (effective group-id) of the user who launched it.

The setuid bit

setuid

user who created it (owner)

setuid

executable bit


$ ls -l /usr/bin/passwd
-rwsr-xr-x 1 root root 59640  三  23  2019 /usr/bin/passwd

passwd


find /usr/bin -perm -u+s -type f -print | xargs ls -ld

The setgid bit

setgid

the privileges of the group of the user who owns it

the group of the parent directory

directory sharing


find /usr/bin -perm -g+s -type f -print | xargs ls -ld

The sticky bit

used on a directory, all the files inside the directory will be modifiable only by their owners, directory owner and the root user

prevent ordinary users from deleting or moving other users' files

other

drwxrwxrwt  22 root root       4096  一   1 12:35 tmp

Therefore,

setuid ==> personal file access permission forwarding
setgid ==> group file access permission forwarding sharing + parent group directory sharing
sticky bit ==> directory sharing without file sharing

FilePermissionsACLs

ACLs are a second level of discretionary permissions, that may override the standard ugo/rwx ones. ACLs are used to define more fine-grained discretionary access control for files and directories.
An ACL consists of entries specifying access permissions on an associated object. ACLs can be configured per user, per group or via the effective rights mask.

Enabling ACLs in the Filesystem

mounted with ACLs turned on


$ sudo tune2fs -l /dev/sda1 | grep acl
Default mount options:    user_xattr acl


 ...
 UUID=07aebd28-24e3-cf19-e37d-1af9a23a45d4    /home    ext4   defaults,acl   0   2
 ...

Set ACL by user


$ sudo touch testAcl.txt
$ ll testAcl.txt
-rw-r--r-- 1 root root 0  一  12 14:33 testAcl.txt

# confirm settings
$ getfacl testAcl.txt
# file: testAcl.txt
# owner: root
# group: root
user::rw-
group::r--
other::r--

# try to write with "jerry"
$ echo "acl" >  testAcl.txt
bash: testAcl.txt: Permission denied

# try to read with "jerry"
$ cat  testAcl.txt

# set write for "jerry"
$ sudo setfacl -m u:jerry:w  testAcl.txt
$ ll testAcl.txt
-rw-rw-r--+ 1 root root 0  一  12 14:33 testAcl.txt
$ getfacl testAcl.txt
# file: testAcl.txt
# owner: root
# group: root
user::rw-
user:jerry:-w-
group::r--
mask::rw-
other::r--

Hack#12 Make sudo Work hard

sudo allows a permitted user to execute a command as the superuser or another user.
sudo is a setuid program.


$ ls -l /usr/bin/sudo
-rwsr-xr-x 1 root root 149080  十  11 02:32 /usr/bin/sudo

sudo command works in conjunction with security policies, default security policy is sudoers and it is configurable via /etc/sudoers file.

Hack#16 Fun with /proc

The /proc filesystem contains a representation of kernel's live process table.
The directories named by numbers contains information for every processes running on the Linux. The number corresponds to the PID.
Take a look at the structure for each process, some are useful:

cwd
exe
cmdline
environ

Hack#18 Manage System resources Per Process

Bash provides a utility "ulimit".
“ulimit” isn’t a separate binary. It’s embedded into the shell itself.


$ ulimit
unlimited

The above means that the current user has “unlimited” amount of resources to be accessed.
To get the report in details,


$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15274
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 15274
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

All the limits that’ll be applicable to the users are defined in:


/etc/security/limits.conf

Partitioning

Partitioning a hard drive divides the available space into sections that can be accessed independently. An entire drive may be allocated to a single partition, or multiple ones for cases such as dual-booting, maintaining a swap partition, or to logically separate data such as audio and video files.

The required information is stored in a partition table scheme such as MBR or GPT.

MBR

The Master Boot Record (MBR) is the first 512 bytes of a storage device. It contains an operating system bootloader and the storage device's partition table. It plays an important role in the boot process under BIOS systems.

Note: The MBR is not located in a partition; it is located at the first sector(usually 512 bytes) of the device (physical offset 0), preceding the first partition.

bootstrap code

partition table

Primary
Extended

Logical

每個partition table採用兩種描述方式：使用柱面/磁頭/扇區 (CHS) 的表示法，以及邏輯塊尋址 (LBA) 的表示法。CHS 表示法已不再使用，因為它局限於 24 位，僅可滿足 8 GB 的磁碟空間。32 位的 LBA 值可滿足多達 2TB（主要指一個sector為512 bytes的硬碟。對於單扇區大小為4K的大容量硬碟來說，這個限制已經增大到了16TB.）. MBR 的另一個缺點在於，硬碟的總空間最多只能劃分為四個partition table。如果硬碟上需要超過四個partition table，則需要創建名為extended partition的特殊partition。The extended partition結構沒有任何形式的錯誤檢測功能，因此難以定位損壞。

GPT

The GUID partition table (GPT) partitioning scheme was introduced by Intel as part of an effort to introduce more modern firmware to generic PC hardware.
GPT is part of the Unified Extensible Firmware Interface (UEFI) specification; it uses globally unique identifiers (GUIDs), or UUIDs in the Linux world, to define partitions and partition types.
與MBR相比，GPT(Globally Unique Identifier Partition Table ，GUID Partition Table, 縮寫：GPT）磁碟分割區樣式支援最大為128個分割，一個分割最大18 EB（Exabytes）
The high-level summary of the block layout used by GPT:

Block	Description
0	Protective MBR
1	Partition Table Header (primary)
2 through 2+b-1	Partition Entry Array (primary)
2+b through n-2-b	partition data
n-2-b+1 through n-2	Partition Entry Array (backup)
n-1	Partition Table Header (backup)

Protective MBR

protective Master Boot Record (PMBR)

Partition Table Header (primary)

a GUID to uniquely identify the disk
the starting block of the partition entry array
the size of each partition entry in that array

Partition Entry Array (primary)

a GUID to uniquely identify the partition itself
a GUID to identify the partition type
the start and end block of the partition
the partition name

partition data
Partition Entry Array (backup)
Partition Table Header (backup)

Tools and Usages

Check for an existing partition


$ sudo fdisk -l /dev/sdb
Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00027921

Disklabel type

Create new partition table


$ sudo parted /dev/sdb
GNU Parted 3.2
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.

create a new GPT partition table


(parted) mklabel gpt

create a new MBR partition table


(parted) mklabel mbr

Create new partitions


 mkpart part-type fs-type start end

part-type

primary
extended
logical

fs-type

help mkpart

start
end

unit


(parted) mkpart fat32 0GB 30GB
(parted) print                                                            
Model: Seagate Portable (scsi)
Disk /dev/sdb: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name   Flags
 1      1049kB  30.0GB  30.0GB               fat32

(parted) mkpart ext4 30GB 250GB                                           
(parted) print                                                           
Model: Seagate Portable (scsi)
Disk /dev/sdb: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name   Flags
 1      1049kB  30.0GB  30.0GB               fat32
 2      30.0GB  250GB   220GB                ext4

delete a partition


rm PartitionNumber

To get the partition information,


$ sudo blkid
/dev/sdb1: UUID="EC2E-4699" TYPE="vfat" PARTLABEL="fat32" PARTUUID="a884291d-6558-4acb-9c2e-8d56ba0cbf21"
/dev/sdb2: UUID="99349696-e669-4ec9-8f7c-475ea4e97102" TYPE="ext4" PARTLABEL="ext4" PARTUUID="ec6f13f8-45a5-4e74-9f91-fdd1741fc3b6"

Booting Process

BIOS Booting

bootloader

MBR

bootloader is a Linux loader

/boot

bootloader is not a Linux Loader

the MBR of the active partition

the Linux kernel file needs to be install in the 1st 1024 cylinders

UEFI Booting

UEFI

EFI system partition

ESP

installed operating systems and various utilities

Both GPT- and MBR-partitioned disks can contain an EFI system partition, as UEFI firmware is required to support both partitioning schemes.

The boot order list is read from a globally defined NVRAM variable. Modifications to this variable are only guaranteed to take effect after the next platform reset. The boot order list defines a list of NVRAM variables that contain information about what is to be booted. Each NVRAM variable defines a name for the boot option that can be displayed to a user.
The variable also contains a pointer to the hardware device and to a file on that hardware device that contains the UEFI image to be loaded.
The variable might also contain paths to the OS partition and directory along with other configuration specific directories

Drive and partition backups with dd

The dd is an command which stands for “data duplicator”.
Note that dd copies "empty" space too, so if the partition is 200MB in size, even if it only contains 100MB of data, the output file will be 200MB in size.

One of the most typical use cases for the utility is the backup of the MBR.

$ sudo dd if=/dev/sda bs=512 count=1 of=mbr.img

make an exact clone of the /boot partition to a backup file


# dd if=/dev/sda1 of=/srv/boot.img

partitions can be restored from these backup copies


# dd if=/srv/boot.img of=/dev/sda1

duplicate an existing drive to another

use fdisk to recreate the appropriately-sized partitions


sfdisk -d /dev/sda | sfdisk /dev/sdb

make sure the partitions on /dev/sdb match those on /dev/sda


# fdisk -l /dev/sda; fdisk -l /dev/sdb

copy the MBR from the first disk to the second


# dd if=/dev/sda of=/dev/sdb bs=446 count=1

copy each partition


# dd if=/dev/sda1 of=/dev/sdb1
# dd if=/dev/sda2 of=/dev/sdb2

Understanding File System Superblock in Linux

Blocks in File System

Hard disk sector is a basic storage unit of the drive.
When a partition or disk is formatted, the sectors in the hardisk is first divided into small groups. This groups of sectors is called as blocks. Block size of file systems is a software construct.
A linux Kernal performs all its operations on a file system using block size of the file system. The block size can never be smaller than the hard disk's sector size, and will always be in multiple of the hard disk sector size. The linux Kernel also requires the file system block size to be smaller or equal to the system page size. Use getconf to query system configuration variables:


$ getconf PAGE_SIZE
4096

The block size is something that can be specified when a user formats a partition using the command line parameters available.


mkfs -t ext3 -b 4096 /dev/sda1

The block size you select will impact the following things:

Maximum File Size
Maximum File System Size
Performance

The ext file system divides the entire space of the partition to equal sized block groups. Block groups are for ease of access during reads and writes. Number of blocks per group is fixed, and cannot be changed.
The layout of a standard block group is approximately as follows:

For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes.

Superblock

The superblock records various information about the enclosing filesystem, such as block counts, inode counts, supported features, maintenance information, and more.
Similar to how i-nodes stores metadata of files, superblocks store metadata of the filesystem.
The superblock information of an existing file system can be viewed by using dumpe2fs,


$ sudo dumpe2fs -h /dev/sda1
[sudo] password for jerry: 
dumpe2fs 1.44.1 (24-Mar-2018)
Filesystem volume name:   
Last mounted on:          /
Filesystem UUID:          3db7ffaf-51bc-4f72-a09d-5ec2f3904c08
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              15269888
Block count:              61049344
Reserved block count:     3052467
Free blocks:              47122648
Free inodes:              14341481
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Jun 24 16:11:33 2019
Last mount time:          Tue Dec 31 16:02:30 2019
Last write time:          Tue Dec 31 16:02:23 2019
Mount count:              102
Maximum mount count:      -1
Last checked:             Mon Jun 24 16:11:33 2019
Check interval:           0 ()
Lifetime writes:          550 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
First orphan inode:       7837802
Default directory hash:   half_md4
Directory Hash Seed:      d9d5e4cd-c2d0-491c-88aa-1762b2295bb1
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xcb19d304
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x0019c036
Journal start:            131623
Journal checksum type:    crc32c
Journal checksum:         0x0f28cdd7

STORAGE ADMINISTRATION GUIDE

2.3. THE /PROC VIRTUAL FILE SYSTEM

The following /proc files are relevant in managing and monitoring system storage:

/proc/devices
/proc/filesystems
/proc/mdstat
/proc/mounts
/proc/partitions

CHAPTER 18. USING THE MOUNT COMMAND

Listing Currently Mounted File Systems


mount

To display only the devices with a certain file system type


mount -t ext4

MOUNTING A FILE SYSTEM


mount [option…] device directory

a full path to a block device

/dev/

a universally unique identifier

UUID=

a volume label

LABEL=

SHARING MOUNTS

--bind

duplicating certain mounts


mount --bind old_directory new_directory


mount --rbind old_directory new_directory

6.4. BACKUP EXT2/3/4 FILE SYSTEMS

If the partition being backed up is an operating system partition, bootup your system into Single User Mode. Use dump to backup the contents of the partitions:


# dump -0uf /backup-files/sda1.dump /dev/sda1

Note:

If the system has been running for a long time, it is advisable to run e2fsck on the partitions before backup.
dump should not be used on heavily loaded and mounted filesystem as it could backup corrupted version of files.

-level#
              The dump level (any integer). A level 0, full backup,  specified  by  -0  guarantees  the
              entire  file system is copied.

-f file
              Write the backup to file

-u     Update the file /var/lib/dumpdates after a successful dump. The format of  /var/lib/dump‐
              dates  is  readable  by people

6.5. RESTORE AN EXT2/3/4 FILE SYSTEM

If you are restoring an operating system partition, bootup your system into Rescue Mode.

Format the destination partitions by using the mkfs command
Prepare the working directories.


# mkdir /mnt/sda1
# mount -t ext3 /dev/sda1 /mnt/sda1

Restore the data.

restore


# cd /mnt/sda1
# restore -rf /backup-files/sda1.dump

12.2. FILESYSTEM-SPECIFIC INFORMATION FOR FSCK

The generic 'fsck' command will attempt to detect the filesystem type, or it will accept parameters specifying the type. 'e2fsck' is essentially a shortcut saying it's an ext2 filesystem. They all behave the same way and check the filesystem for errors.
fsck is simply a front end that calls the appropriate tool for the filesystem in question,


lrwxrwxrwx 1 root root      8  六  24  2019 /sbin/dosfsck -> fsck.fat
-rwxr-xr-x 1 root root 314080  九  27 02:01 /sbin/e2fsck
-rwxr-xr-x 1 root root  47232  八  23 07:47 /sbin/fsck
-rwxr-xr-x 1 root root  34928  八  23 07:47 /sbin/fsck.cramfs
lrwxrwxrwx 1 root root      6  九  27 02:01 /sbin/fsck.ext2 -> e2fsck
lrwxrwxrwx 1 root root      6  九  27 02:01 /sbin/fsck.ext3 -> e2fsck
lrwxrwxrwx 1 root root      6  九  27 02:01 /sbin/fsck.ext4 -> e2fsck
-rwxr-xr-x 1 root root  59472  一  25  2017 /sbin/fsck.fat
-rwxr-xr-x 1 root root  92264  八  23 07:47 /sbin/fsck.minix
lrwxrwxrwx 1 root root      8  六  24  2019 /sbin/fsck.msdos -> fsck.fat
lrwxrwxrwx 1 root root      8  六  24  2019 /sbin/fsck.vfat -> fsck.fat

If these filesystems encounter metadata inconsistencies while mounted, they will record this fact in the filesystem superblock. If e2fsck finds that a filesystem is marked with such an error, e2fsck will perform a full check .

CHAPTER 13. PARTITIONS

parted is a program to manipulate disk partitions.


$ sudo parted -l 
[sudo] password for jerry: 
Model: ATA WDC WD2500BEKT-7 (scsi)
Disk /dev/sda: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End    Size   Type     File system  Flags
 1      1049kB  250GB  250GB  primary  ext4         boot

CHAPTER 14. LVM (LOGICAL VOLUME MANAGER)

LVM2, can be used to gather existing storage devices into groups and allocate logical units from the combined space as needed. Physical volumes are regular storage devices. LVM writes a header to the device to allocate it for management. LVM combines physical volumes into storage pools known as volume groups. A volume group can be sliced up into any number of logical volumes. Logical volumes are functionally equivalent to partitions on a physical disk, but with much more flexibility. In summary, LVM can be used to combine physical volumes into volume groups to unify the storage space available on a system. Afterwards, administrators can segment the volume group into arbitrary logical volumes, which act as flexible partitions.

Each volume within a volume group is segmented into small, fixed-size chunks called extents. The extents on a physical volume are called physical extents, while the extents of a logical volume are called logical extents. A logical volume is simply a mapping that LVM maintains between logical and physical extents.

To use LVM, the first step is to scan the system for block devices that LVM can see and manage.


$ sudo lvmdiskscan | grep sd
  /dev/sda1   [     232.88 GiB] 
  /dev/sdb1   [     <27.94 GiB] 
  /dev/sdb2   [     204.89 GiB] 
  /dev/sdb3   [     <46.57 GiB]

Warning: Make sure that you double-check that the devices you intend to use with LVM do not have any important data already written to them. Using these devices within LVM will overwrite the current contents.

We can mark these 2 physical partitions as physical volumes within LVM using the pvcreate command:


$ sudo pvcreate /dev/sdb2 /dev/sdb3
WARNING: ext4 signature detected on /dev/sdb2 at offset 1080. Wipe it? [y/n]: y
  Wiping ext4 signature on /dev/sdb2.
WARNING: ext4 signature detected on /dev/sdb3 at offset 1080. Wipe it? [y/n]: y
  Wiping ext4 signature on /dev/sdb3.
  Physical volume "/dev/sdb2" successfully created.
  Physical volume "/dev/sdb3" successfully created.

pvcreate initializes a physical volume(PV) so that it is recognized as belonging to LVM, and allows the physical volume to be used in a volume group(VG). A PV can be a disk partition, whole disk, meta device, or loopback file. We can use pvs to display information about physical volumes,


$ sudo pvs
  PV         VG Fmt  Attr PSize   PFree  
  /dev/sdb2     lvm2 ---  204.89g 204.89g
  /dev/sdb3     lvm2 ---  <46.57g <46.57g

PVs can be removed:


$ sudo pvremove  /dev/sdb2 /dev/sdb3
  Labels on physical volume "/dev/sdb2" successfully wiped.
  Labels on physical volume "/dev/sdb3" successfully wiped.

Creating Volume Groups:


$ sudo vgcreate vg1 /dev/sdb2 /dev/sdb3
  Physical volume "/dev/sdb2" successfully created.
  Physical volume "/dev/sdb3" successfully created.
  Volume group "vg1" successfully created

We can see a brief summary of the volume group,


$sudo vgs
  VG  #PV #LV #SN Attr   VSize   VFree  
  vg1   2   0   0 wz--n- 251.45g 251.45g

Currently, there has two physical volumes, zero logical volumes, and has the combined capacity of the underlying devices. We can use VG as a pool that we can allocate logical volumes from. You can use vgcreate to create a new VG on the PV, or vgextend to add the PV to existing VG. To create logical volumes, we use the lvcreate command:

pass in the volume group to pull from
name the logical volume with the -n option
specify the size with the -L option

We can create the first three logical volumes like this:


$ sudo lvcreate -L 10G -n projects vg1
  Logical volume "projects" created.
$ sudo lvcreate -L 5G -n www vg1
  Logical volume "www" created.
$ sudo lvcreate -L 20G -n db vg1
  Logical volume "db" created.

We can see the logical volumes and their relationship to the volume group,


$ sudo vgs -o +lv_size,lv_name
  VG  #PV #LV #SN Attr   VSize   VFree   LSize  LV      
  vg1   2   3   0 wz--n- 251.45g 216.45g 10.00g projects
  vg1   2   3   0 wz--n- 251.45g 216.45g  5.00g www     
  vg1   2   3   0 wz--n- 251.45g 216.45g 20.00g db

Now, we can allocate the rest of the space in the volume group to the “workspace” volume using the -l flag,


$ sudo lvcreate -l 100%FREE -n workspace vg1
  Logical volume "workspace" created.

$ sudo vgs -o +lv_size,lv_name
  VG  #PV #LV #SN Attr   VSize   VFree LSize   LV       
  vg1   2   4   0 wz--n- 251.45g    0   10.00g projects 
  vg1   2   4   0 wz--n- 251.45g    0    5.00g www      
  vg1   2   4   0 wz--n- 251.45g    0   20.00g db       
  vg1   2   4   0 wz--n- 251.45g    0  216.45g workspace

As you can see, the “vg1” volume group is completely allocated. The logical volume devices are available within the /dev directory just like other storage devices. You can access them in two places:


/dev/volume_group_name/logical_volume_name
/dev/mapper/volume_group_name-logical_volume_name

and format logical volumes with the Ext4 filesystem,


$ sudo mkfs.ext4 /dev/vg1/projects
$ sudo mkfs.ext4 /dev/vg1/www
$ sudo mkfs.ext4 /dev/vg1/db
$ sudo mkfs.ext4 /dev/vg1/workspace

Linux Server Hacks, Volume Two: Storage Management and Backups

#46 Create Flexible Storage with LVM

Logical volumes, which are filesystems that appear to be single volume but are actually assembled from space that has been allocated on multiple physical partitions. The size of a logical volume can exceed the size of any of the physical storage devices on your system, but it cannot exceed the sum of all of their sizes.

Linux process management, user management and package management

Process Management

Every process has 6 or more IDs associated with it:

real User ID ans real group ID

executor

effective user ID, effective group ID, supplementaty group ID

file access permission

set-user-ID

set-group-ID

the owner of the file

saved set-user-ID and set-group-ID

exec

A process refers to a program in execution; it’s a running instance of a program.
The only way a new process is created by the kernel is when an existing process calls the fork() function.
The new process created by fork is called the child process.
Both the child and parent continue executing with the instruction that follows the call to fork. The child is a copy of its parent:

data
heap
stack

The parent and child do not share these positions of memory.
A fork is often followed by an exec.
The child process will have the same environment as its parent, but only the process ID number is different. A executed program is identified by its process ID (PID) as well as its parent processes ID (PPID). You can use the pidof command to find the ID of a process:


$ pidof init
1

To find the process ID and parent process ID of the current shell, run:


$ echo $$
$ echo $PPID

If we have an application that can only do one thing at a time, it is really bad. This is where thread steps in.

A process can have multiple threads.
Meaning threads will be part of a process (all threads of the same process will share same PID).

In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space(not a copy). Processes run in separate virtual memory spaces.
A thread is a path of execution within a process. Threads share with other threads their code section, data section, and OS resources (like open files and signals). But, like process, a thread has its own program counter (PC), register set, and stack space.

The low level interface to create threads is the clone() system call. The higher level interface is pthread_create().


#include <pthread.h>

       int pthread_create(pthread_t *thread, const pthread_attr_t *attr,
                          void *(*start_routine) (void *), void *arg);

       Compile and link with -pthread.

The pthread_create() function starts a new thread in the calling process.
The new thread starts execution by invoking start_routine(); arg is the only one argument passed start_routine().

Why threads?. This is because communication between processes is not simple:

Some overhead involved and is comparatively slow.
Context switching between threads are faster compared to switching between processes.

Although you can use clone() system call to create a thread, it is recommended to use pthread_create(). This is for portability reasons.
In Linux, these thread id numbers are indicated by LWP (Light Weight Process) , the ps command column name is also LWP:


$ ps -efL
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
root         1     0     1  0    1 十二23 ?     00:00:10 /sbin/init splash
root         2     0     2  0    1 十二23 ?     00:00:00 [kthreadd]
...
jerry     2464     1  2464  0   33 十二23 tty2  00:11:23 /opt/google/chrome/chrome
jerry     2464     1  2472  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2479  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2480  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2483  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome
jerry     2464     1  2484  0   33 十二23 tty2  00:04:37 /opt/google/chrome/chrome
jerry     2464     1  2485  0   33 十二23 tty2  00:00:00 /opt/google/chrome/chrome

A process group is a collection of one or more processes. Each process group can have a process group leader. The leader is identified by its process ID equals its process group ID.

A session is a collection of one or more process groups. A process establishes a new session by calling the setsid() function.

A controlling terminal is the terminal device(tty/pts):

A session can only have one controlling terminal.
The session leader that establishes the connection to the controlling terminal is called the controlling process.
The process groups within a session can be divided into a single foreground process group, and, one or more background process groups.

Foreground processes (also referred to as interactive processes) are initialized and controlled through a terminal session. Background processes (also referred to as non-interactive/automatic processes) – are processes not connected to a terminal; they don’t expect any user input. A new process is normally created when an existing process makes an exact copy of itself in memory by fork().

Job control allows us to start multiple jobs from a single terminal and controls which jobs can access the terminal and which jobs are to run in the background.
Job control requires:

A shell that supports job control
The terminal driver supports job control
Support for job control signals

To start a process in the background (non-interactive), use the & symbol. 此時 bash 會給予這個指令一個『工作號碼(job number)』如果想要知道目前有多少的工作在背景當中，就用 jobs 這個指令，可以加上 -l 這個參數得到job number 的 PID 經過 fg 指令就能夠將背景工作拿到前景來處理.
pgrep looks through the currently running processes and lists the process IDs which match the selection criteria to stdout. All the criteria have to match. For example,


              $ pgrep -u root sshd

will only list the processes named sshd AND owned by root. On the other hand,


              $ pgrep -u root,daemon

will list the processes owned by root OR daemon. pkill will send the specified signal (by default SIGTERM) to each process matching a pattern.


              $ pkill chrom

The above kill processes include chrome. "killall" needs an exact process name.

PROCESS STATE CODES:
Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process:

D    uninterruptible sleep (usually IO)
R    running or runnable (on run queue)
S    interruptible sleep (waiting for an event to complete)
T    stopped by job control signal
t    stopped by debugger during the tracing
W    paging (not valid since the 2.6.xx kernel)
X    dead (should never be seen)
Z    defunct ("zombie") process, terminated but not reaped by its parent

For BSD formats and when the stat keyword is used, additional characters may be displayed:

<    high-priority (not nice to other users)
N    low-priority (nice to other users)
L    has pages locked into memory (for real-time and custom IO)
s    is a session leader
l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
+    is in the foreground process group

Linux Server Hacks

Remove Unnecessary Services

Use "ps ax" to check if unnecessary services exists:

NFS
Samba
automount

User Management

Creating a User

adduser

Deleting, disabling account

passwd -l 'username'
userdel -r 'username'

Modify groups

groupmod

Modify an account

add a user to a group

usermod -a -G

Gives information on all users

finger

deluser and delgroup remove users and groups from the system according to command line options and configuration information in /etc/deluser.conf and /etc/adduser.conf. They are friendlier front ends to the userdel and groupdel programs

Package Management

Debian/Ubuntu

dpkg

dpkg is a tool to install, build, remove and manage Debian packages. The primary and more user friendly front-end for dpkg is aptitude.

aptitude

aptitude is a text-based interface to the Debian GNU/Linux package system. It allows the user to view the list of packages and to perform package management tasks such as installing, upgrading, and removing packages. Actions may be performed from a visual interface or from the command-line.

apt-get, apt

apt-get is the command-line tool for handling packages, and may be considered the user's "back-end" to other tools(aptitude, synaptic and wajig) using the APT library. apt provides a high-level command line interface for the package management system. It is intended as an end user interface and enables some options better suited for interactive usage by default. apt is less-commonly used and differs from apt-get mostly in terms of output formatting.

update

update

package index files

upgrade

upgrade

the newest versions of all packages currently

install

install

Network skill

IP

Ip header

big endian

big-endian


0x100: 01 23 45 67
0x104:

little-endian


0x100: 67 45 23 01
0x104:

fragmentation

total length

identification

flags

fragmentation

TTL(time-to-live)

IP Routing

The IP layer has a routing table in memory that it searches each time it receives a datagram. Each entry in the routing table contains the following information:

destination IP
IP of the next-hop router
flags
interface

There is an entry labeled "default". If the destination IP can't match any entry, packets are sent to the next-hop router of the default.

Subnet

網路等級	IP分佈範圍	可用網路組	可連結主機數目
A	0.0.0.0~127.0.0.0	126組	16,777,214
B	128.0.0.0~191.255.0.0	16383組	65,534
C	192.0.0.0~223.255.255.0	2,097,152組	254
D	224.0.0.0~239.255.255.255
E	240.0.0.0~255.255.255.255

All IPv4 IP addresses can be divided into two major groups: public or private.
Private internal addresses are not routed on the Internet and no traffic cannot be sent to them from the Internet, they only supposed to work within the local network. Private addresses include IP addresses from the following classes:

172.16

192.168

The host ID portion can be divided into a subnet ID and a host ID.

The subnet mask is a 32-bits value containing "1" bits for the network ID and subnet ID, and "0" bits for the host ID. Therefore,

IP address tells you what network class you are using
The subnet mask tells you the boundary between hosts and subnets

TCP

TCP services

TCP provides reliability:

the bytes passed by TCP to IP is called a segment.
TCP uses a timer to wait for an ACK to the segment it sent.
TCP sends an ACK for each data it received

SYNC

sequence number

initial sequence number

acknowledge number

TCP maintains a checksum on its header and data
TCP resequences the received data when necessary
TCP must discard duplicated received data
TCP provides flow control

this ACK message includes the 16-bits current receive window size field


   rwnd_size = ReceiveBuffer - (LastByteReceived – LastByteReadByApplication)

sliding window


  LastByteSent - LastByteAcked <= ReceiveWindowAdvertised

TCP Window Scaling

persist timer

ZeroWindowProbe

rfc

TCP Connection Establishment and Termination

establishment: 3-way handshake

the client sends SYN + ISN
the server responds with its own SYN + server's ISN + ACK(client's ISN+1)
the client ACK(server's ISN )

termination

TCP Timers

TCP maintains the connection state internally and requires timers to keep track of events.
The TCP requires 3 timers to maintain the state on the transmit side in the protocol.
In Linux, when timers are initialized, they are given an associated function that is called when the timer goes off. Each timer function for TCP is passed a pointer to the sock structure. The timer uses the sock to know which connection it is dealing with. The timer functions can be found in the file linux/net/tcp_timer.c.

tcp_retransmit_timer()

retransmission of lost segments

an expected acknowledgment was not received

Sender starts a time out timer after transmitting a TCP segment to the receiver.

If sender receives an ACK before the timer goes off, it stops the timer.
If sender does not receives any acknowledgement and the timer goes off, then TCP Retransmission occurs.
Sender retransmits the same segment and resets the timer

The value of time out timer is dynamic and changes with the amount of traffic in the network.

tcp_probe_timer()

zero window advertisement from the peer

the timer expired before a response was received to the zero window probe

tcp_delack_timer()

the amount of time to hold the ACK waiting for outgoing data to be ready

tcp_keepalive_timer()

to prevent long idle TCP connections

keepalive polling is not part of the TCP specification

Each time server hears from the client, it resets the keep alive timer to 2 hours.
If server does not hear from the client for 2 hours, it sends 10 probe segments to the client.
These probe segments are sent at a gap of 75 seconds.
If server receives no response after sending 10 probe segments, it assumes that the client is down.
Then, server terminates the connection automatically.

Time Wait Timer

Sender starts the time wait timer after sending the ACK for the second FIN segment.
It allows to resend the final acknowledgement if it gets lost.
It prevents the just closed port from reopening again quickly to some other application.
It ensures that all the segments heading towards the just closed port are discarded.
The value of time wait timer is usually set to twice the lifetime of a TCP segment.

Firewall

A Packet-Filtering Firewall

A packet-filtering firewall consists of a list of acceptance and denial rules. The list of rules defining what can come in and what can go out are called chains. A packet is matched against each rule in the list, 1-by-1, until a match is found or the list is exhausted.

Choosing a Default Packet-filtering Policy

If the packet does not match any rule, the default policy for a chain is applied to the packet,

ACCEPT
DROP
RETURN

In a security-conscious system, this policy usually tells the kernel to DROP the packet.

How Packets Traverse The Filters

The kernel starts with three lists of rules( chains ) in the `filter' tables: INPUT, OUTPUT and FORWARD.


Incoming                                          Outgoing
                                                     
   |                                                  ^
   |                                                  |
   v                                                  |
 ------------                                  --------------
|Pre-Routing |                                 |Post-routing |
 ------------                                  --------------
   |                                                  |
   |                        _____                     |
   |                       /     \                    |
   +---- -->[Routing ]--->|FORWARD|-------+--->-------+
            [Decision]     \_____/        ^
                 |                        |
                 v                       ____
                ___                     /    \
               /   \                   |OUTPUT|
              |INPUT|                   \____/
               \___/                      ^
                 |                        |
                 +----> Local Process ->--+

When a packet comes in (say, through the Ethernet card) the kernel first looks at the destination of the packet: this is called `routing'.
If it's destined for this box, the packet passes downwards in the diagram, to the INPUT chain. If it passes this, any processes waiting for that packet will receive it.
If the kernel does not have forwarding enabled, or it doesn't know how to forward the packet, the packet is dropped.
If forwarding is enabled, and the packet is destined for another network interface (if you have another one), then the packet goes rightwards on our diagram to the FORWARD chain. If it is ACCEPTed, it will be sent out.
A program running on the box can send network packets. These packets pass through the OUTPUT chain immediately: if it says ACCEPT, then the packet continues out to whatever interface it is destined for.

There are five netfilter hooks that programs can register with. As packets traverse through the stack, they will trigger the kernel modules registered with these hooks. The following hooks represent various well-defined points in the networking stack:

NF_IP_PRE_ROUTING
NF_IP_LOCAL_IN
NF_IP_FORWARD
NF_IP_LOCAL_OUT
NF_IP_POST_ROUTING

Using iptables

Iptables and ip6tables are used to set up, maintain, and inspect the tables of IPv4 and IPv6 packet filter rules in the Linux kernel. The iptables uses tables to organize its rules. These tables classify rules according to the type of decisions they are used to make. The names of the built-in chains mirror the names of the netfilter hooks they are associated with:

PREROUTING
INPUT
FORWARD
OUTPUT
POSTROUTING

Each table has its chains. The 3 primary tables:

filter

INPUT
OUTPUT
FORWARD

nat

PREROUTING
OUTPUT
POSTROUTING

mangle

PREROUTING
OUTPUT

A firewall rule specifies criteria for a packet and a target. If the packet does not match, the next rule in the chain is examined; if it does match, then the next rule is specified by the value of the target, which can be

the name of a user-defined chain
one of the targets described in iptables-extensions(8)
one of the special values ACCEPT, DROP or RETURN.

SYNOPSIS:

Create a new user-defined chain by the given name.

iptables [-t table] -N chain

Flush the selected chain (all the chains in the table if none is given).

iptables [-t table] -F [chain [rulenum]] [options...]

Append/Check/Delete one or more rules to the end of the selected chain.

iptables [-t table] {-A|-C|-D} chain rule-specification

Set the default policy for the built-in (non-user-defined) chain to the given target( ACCEPT or DROP ).

iptables [-t table] -P chain target

List all rules in the selected chain.

iptables [-t table] -L [chain [rulenum]] [options...]

-n
-v
--line-numbers
-x

Insert one or more rules in the selected chain as the given rule number.

iptables [-t table] -I chain [rulenum] rule-specification

Replace a rule in the selected chain.

iptables [-t table] -R chain rulenum rule-specification

Delete one or more rules from the selected chain.

iptables [-t table] -D chain rulenum

Print all rules in the selected chain.

iptables [-t table] -S [chain [rulenum]]

List all rules in the selected chain.

iptables [-t table] -L [chain [rulenum]] [options...]

Delete the optional user-defined chain specified.
Rename the user specified chain to the user supplied name.


rule-specification = parameter-1 option-1
                     ...
                     parameter-n> option-n

The rule-specification are composed of pairs of parameters and options that define what happens when a packet matches the rule.
Basic parameters:

-i [!] name
-o [!] name
-p [!] protocol
-s [!] address[/mask]
-d [!] address[/mask]
-j target

Different network protocols provide specialized matching options which can be configured to match a particular packet using that protocol.

options are available for the TCP protocol (-p tcp)

--dport [!] port[:port]
--sport [!] port[:port]
[!] --syn
--tcp-flags [!] mask comp

--syn

--tcp-option [!] number

options are available for the UDP protocol (-p udp)

--dport [!] port[:port]
--sport [!] port[:port]

ICMP Protocol (-p icmp)

--icmp-type [!] typename

Additional match options are also available through modules loaded by the iptables command.
To use a match option module, load the module by name using the -m option.

-m limit

--limit rate
--limit-burst number

-m state


 --state state

state

ESTABLISHED
INVALID
NEW
RELATED

-m mac

--mac-source [!] address

-mark

--mark value[/mask]

-tos

--tos value

Once a packet has matched a particular rule, the rule can direct the packet to a number of different targets that decide its fate and, possibly, take additional actions. Each chain has a default target, which is used if none of the rules on that chain match a packet or if none of the rules which match the packet specify a target.
The following are the standard targets:

-j user-defined-chain
-j ACCEPT
-j DROP
-j QUEUE
-j RETURN

iptables can use extended target modules: the following are included in the standard distribution.

-j LOG

--log-level
--log-ip-options
--log-prefix
--log-tcp-options
--log-tcp-sequence

-j REJECT


--reject-with type

icmp-net-unreachable
icmp-host-unreachable
icmp-port-unreachable
icmp-proto-unreachable
icmp-net-prohibited
icmp-host-prohibited

-j SNAT

nat

POSTROUTING

--to-source ipaddr[-ipaddr][:port-port]

-j MASQUERADE

nat

POSTROUTING

-j DNAT

nat

PREROUTING

OUTPUT

--to-destination ipaddr[-ipaddr][:port-port]

forward connections to internal servers

-j REDIRECT

nat

PREROUTING

OUTPUT

--to-ports port[-port]

DNS: The Domain Name System

DNS is a distributed database that is used by TCP/IP applications to map between hostnames and IP addresses. DNS provides the protocol for clients and servers to communicate with each other. From application's point of view, access to the DNS is through a name resolver which contacts one or more name servers to do the mapping. The resolver can be accessed by 2 library functions:

getaddrinfo()
getnameinfo()

DNS Basics

The DNS name space is a hierarchical tree, similar to file system.

node
root
domain name
FQDN(fully qualified domain name)

absolute domain name

For 3-character generic domains, .gov and .mil are only used for U.S.. Many countries form second-level domains beneath their 2-character country code: .com.tw , .edu.tw are used for commercial and education organizations respectively in Taiwan. One entity maintains the top-level domains, delegates are responsible to others for specific zones. A zone is a subtree of the DNS tree. Once the authority for a zone is delegated, the person responsible for the zone must provide a primary name server for that zone and one or more secondary name servers. The main difference between a primary and a secondary is that the primary loads all the information for a zone from disk files, while the secondary obtains all the information from the primary regularly. When a name server does not contain the requested information, it must contact another name server. At least, every name server must know how to contact the root name servers, the root name servers know the name server for all second-level domains.

DNS message format

DNS configuration

The file /etc/resolv.conf is actually used indirectly now.
The network manager does it now.:


nmcli device show interfacename | grep IP4.DNS

Linux Server Hacks, Volume Two: System Services

#20 Quick and Easy DHCP Setup

Installing a DHCP Server


apt-get install isc-dhcp-server


yum install dhcp

Configuring Simple DHCP Services


# option definitions common to all supported networks...
option domain-name "example.org";
option domain-name-servers ns1.example.org, ns2.example.org;

default-lease-time 600;
max-lease-time 7200;
option domain-name "isc.org";
option domain-name-servers ns1.isc.org, ns2.isc.org;

# The ddns-updates-style parameter controls whether or not the server will
# attempt to do a DNS update when a lease is confirmed. We default to the
# behavior of the version 2 packages ('none', since DHCP v2 didn't
# have support for DDNS.)
ddns-update-style none;

# If this DHCP server is the official DHCP server for the local
# network, the authoritative directive should be uncommented.
#authoritative;

# Use this to send dhcp log messages to a different log file (you also
# have to hack syslog.conf to complete the redirection).
#log-facility local7;

# No service will be given on this subnet, but declaring it helps the 
# DHCP server to understand the network topology.
#subnet 10.152.187.0 netmask 255.255.255.0 {
#}

# This is a very basic subnet declaration.

#subnet 10.254.239.0 netmask 255.255.255.224 {
#  range 10.254.239.10 10.254.239.20;
#  option routers rtr-239-0-1.example.org, rtr-239-0-2.example.org;
#}

# This declaration allows BOOTP clients to get dynamic addresses,
# which we don't really recommend.

#subnet 10.254.239.32 netmask 255.255.255.224 {
#  range dynamic-bootp 10.254.239.40 10.254.239.60;
#  option broadcast-address 10.254.239.31;
#  option routers rtr-239-32-1.example.org;
#}

# A slightly different configuration for an internal subnet.
#subnet 10.5.5.0 netmask 255.255.255.224 {
#  range 10.5.5.26 10.5.5.30;
#  option domain-name-servers ns1.internal.example.org;
#  option domain-name "internal.example.org";
#  option subnet-mask 255.255.255.224;
#  option routers 10.5.5.1;
#  option broadcast-address 10.5.5.31;
#  default-lease-time 600;
#  max-lease-time 7200;
#}
# Hosts which require special configuration options can be listed in
# host statements.   If no address is specified, the address will be
# allocated dynamically (if possible), but the host-specific information
# will still come from the host declaration.

#host passacaglia {
#  hardware ethernet 0:0:c0:5d:bd:95;
#  filename "vmunix.passacaglia";
#  server-name "toccata.example.com";
#}

# Fixed IP addresses can also be specified for hosts.   These addresses
# should not also be listed as being available for dynamic assignment.
# Hosts for which fixed IP addresses have been specified can boot using
# BOOTP or DHCP.   Hosts for which no fixed address is specified can only
# be booted with DHCP, unless there is an address range on the subnet
# to which a BOOTP client is connected which has the dynamic-bootp flag
# set.
#host fantasia {
#  hardware ethernet 08:00:07:26:c0:a5;
#  fixed-address fantasia.example.com;
#}

# You can declare a class of clients and then do address allocation
# based on that.   The example below shows a case where all clients
# in a certain class get addresses on the 10.17.224/24 subnet, and all
# other clients get addresses on the 10.0.29/24 subnet.

#class "foo" {
#  match if substring (option vendor-class-identifier, 0, 4) = "SUNW";
# set.
#host fantasia {
#  hardware ethernet 08:00:07:26:c0:a5;
#  fixed-address fantasia.example.com;
#}

# You can declare a class of clients and then do address allocation
# based on that.   The example below shows a case where all clients
# in a certain class get addresses on the 10.17.224/24 subnet, and all
# other clients get addresses on the 10.0.29/24 subnet.

#class "foo" {
#  match if substring (option vendor-class-identifier, 0, 4) = "SUNW";
#}

#shared-network 224-29 {
#  subnet 10.17.224.0 netmask 255.255.255.0 {
#    option routers rtr-224.example.org;
#  }
#  subnet 10.0.29.0 netmask 255.255.255.0 {
#    option routers rtr-29.example.org;
#  }
#  pool {
#    allow members of "foo";
#    range 10.17.224.10 10.17.224.250;
#  }
#  pool {
#    deny members of "foo";
#    range 10.0.29.10 10.0.29.230;
#  }
#}

DHCP Global configuration

default-lease-time
max-lease-time
INTERFACESv4="eth0"
authoritative

Defining the Subnet


       subnet 204.254.239.0 netmask 255.255.255.224 {
         ( subnet-specific parameters... )
         range 204.254.239.10 204.254.239.30;
       }

Defining a group of host


       group {
         ( group-specific parameters... )
         host zappo.test.isc.org {
           ( host-specific parameters... )
         }
         host beppo.test.isc.org {
           ( host-specific parameters... )
         }
         host harpo.test.isc.org {
           ( host-specific parameters... )
         }
       }

Manage DHCP services


sudo systemctl status isc-dhcp-server.service


sudo systemctl start isc-dhcp-server.service


sudo systemctl stop isc-dhcp-server.service


sudo systemctl restart isc-dhcp-server.service

#21 Integrate DHCP and DNS with Dynamic DNS Updates

If DNS and DHCP servers are not in sync, a DHCP release for a new IP may cause the name resolution problem. There are 2 solutions: statically assign addresses to your hosts, or use a tool (or script one yourself) to perform DNS updates. In more recent versions of DHCP and BIND, both services support a mechanism for performing dynamic DNS updates (defined in RFC 2136).

Generating session key

dnssec-keygen

Configuring the BIND Name Server
Configuring the ISC DHCP Server

Linux network troubleshooting tools

Cheat sheet

ip

ip - show / manipulate routing, network devices, interfaces and tunnels .

link layer

network device. Show the status:


$ ip link show
1: lo: <loopback> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <no-carrier> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 5c:26:0a:13:82:cf brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <broadcast> mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether a4:4e:31:a6:78:64 brd ff:ff:ff:ff:ff:ff

Bring the interface up,


$ sudo ip link set eno1 up

We can use the -s flag with the ip command to print additional statistics about an interface,


$ ip -s link show
1: lo: <loopback> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    RX: bytes  packets  errors  dropped overrun mcast   
    6259323    66161    0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    6259323    66161    0       0       0       0       
2: eno1: <no-carrier> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 5c:26:0a:13:82:cf brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    0          0        0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    0          0        0       0       0       0       
3: wlp2s0: <broadcast> mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether a4:4e:31:a6:78:64 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    1373619334 1354333  0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    229249749  830716   0       0       0       0

For more advanced Layer 1 troubleshooting, the ethtool/wireshark utility is an excellent option.

data link layer

If your localhost can’t successfully resolve its gateway’s Layer 2 MAC address, then it won’t be able to send any traffic to remote networks.
We can check the entries in our ARP table with the ip neighbor command:


        $ ip neighbor show
192.168.0.1 dev wlp2s0 lladdr c0:a0:bb:ef:3d:d7 REACHABLE
fe80::c2a0:bbff:feef:3dd7 dev wlp2s0 lladdr c0:a0:bb:ef:3d:d7 router REACHABLE

If there was a problem with ARP, then we would see a resolution failure.
Linux caches the ARP entry for a period of time, you can manually delete an ARP entry, which will force a new ARP discovery process:


$ ip neighbor delete 192.168.0.1 dev wlp2s0

network/internet layer


$ ip -br address show
lo               UNKNOWN        127.0.0.1/8 ::1/128 
eno1             DOWN           
wlp2s0           UP             192.168.0.105/24 2001:b011:5003:14ec:2976:d29d:44a9:5cd6/64 2001:b011:5003:14ec:e136:5b5f:ebb4:4e4e/64 fe80::8205:ef53:5b3b:7756/64

-br prints only basic information in a tabular format for better readability.
This option is currently only supported by ip addr show and ip link show commands.
The lack of an IP address can be caused by a local misconfiguration, such as an incorrect network interface config file, or it can be caused by problems with DHCP.
ping can be an easy way to tell if a host is alive and responding.
The next tool in the Layer 3 troubleshooting tool belt is the traceroute command.
Traceroute will send out one packet at a time, beginning with a TTL of one. Since the packet expires in transit, the upstream router sends back an ICMP Time-to-Live Exceeded packet.
Traceroute then increments the TTL to determine the next hop.
The list of gateways for different routes is stored in a routing table, which can be inspected and manipulated using ip route commands.
We can print the routing table:


$ ip route show
default via 192.168.0.1 dev wlp2s0 proto dhcp metric 600 
169.254.0.0/16 dev wlp2s0 scope link metric 1000 
192.168.0.0/24 dev wlp2s0 proto kernel scope link src 192.168.0.105 metric 600

we can check the route for a specific prefix:


$ ip route get 10.0.0.0/8
10.0.0.0 via 192.168.0.1 dev wlp2s0 src 192.168.0.105 uid 1000 
    cache

To delete a default route:


$ sudo route delete default gw 192.168.1.250 eth0

The Domain Name System (DNS) translates IP addresses into human-readable names.
A DNS trouble is the ability to connect to a remote host by IP address but not its hostname.
Performing a quick nslookup on the hostname can tell us what happened.

transport layer

To find out which process is listing upon a port:

netstat


$ netstat -tulpn 
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name

socket statistics

netstat


$ ss -tunlp4
NetidState  Recv-Q  Send-Q    Local Address:Port    Peer Address:Port                                     
udp  UNCONN 0       0           224.0.0.251:5353         0.0.0.0:*     users:(("chrome",pid=2410,fd=49))  
udp  UNCONN 0       0           224.0.0.251:5353         0.0.0.0:*     users:(("chrome",pid=2372,fd=322)) 
udp  UNCONN 0       0           224.0.0.251:5353         0.0.0.0:*     users:(("chrome",pid=2410,fd=120)) 
udp  UNCONN 0       0               0.0.0.0:5353         0.0.0.0:*                                        
udp  UNCONN 0       0         127.0.0.53%lo:53           0.0.0.0:*                                        
udp  UNCONN 0       0               0.0.0.0:68           0.0.0.0:*                                        
udp  UNCONN 0       0               0.0.0.0:43241        0.0.0.0:*                                        
udp  UNCONN 0       0               0.0.0.0:631          0.0.0.0:*                                        
tcp  LISTEN 0       128           127.0.0.1:5939         0.0.0.0:*                                        
tcp  LISTEN 0       128       127.0.0.53%lo:53           0.0.0.0:*                                        
tcp  LISTEN 0       5             127.0.0.1:631          0.0.0.0:*                                        
tcp  LISTEN 0       100           127.0.0.1:25           0.0.0.0:*                                        
tcp  LISTEN 0       80            127.0.0.1:3306         0.0.0.0:*

-t - Show TCP ports.
-u - Show UDP ports.
-n - Do not try to resolve hostnames.
-l - Show only listening ports.
-p - Show the processes that are using a particular socket.
-4 - Show only IPv4 sockets.

To test if the remote server is listening,


telent ip port

netcat


sudo apt-get install netcat


nc ip -u port

Use nc to test connections:

Test on a TCP port


nc -lvnp 1234


nc -vn 192.168.0.112 1234

Test on a UDP port


nc -lvnup 1234


nc -vnu 192.168.0.112 1234

Perform TCP port scanning against a remote host


nc -vnz -w 1 192.168.0.101 20-25

Perform UDP port scanning against a remote host


nc -vnzu 192.168.40.146 1-65535

Stream a video file from a server for client to watch the streamed video using video player (e.g., mplayer)


cat sample_video.avi | nc -l 1234


nc 192.168.0.101 1234 | mplayer -vo x11 -cache 3000 -

However, a much more powerful tool is nmap which can scan TCP and UDP ports listened on the remote host.


$ nmap -v 192.168.122.1

PORT     STATE SERVICE
22/tcp   open  ssh
53/tcp   open  domain
80/tcp   open  http
443/tcp  open  https
8081/tcp open  blackice-icecap

A tcpdump Tutorial with Examples — 50 Ways to Isolate Traffic

The traffic can be dumpped with different ways:

on an interface


tcpdump -i eth0

by ip


tcpdump host 1.1.1.1

by source and/or destination


tcpdump src 1.1.1.1
tcpdump dst 1.0.0.1

by network


tcpdump net 1.2.3.0/24

hex output


tcpdump -X icmp

a specific port


tcpdump port 3389
tcpdump src port 1025

protocol


tcpdump icmp

ip6 traffic


tcpdump ip6

port ranges


tcpdump portrange 21-23

packet size


tcpdump less 32
tcpdump greater 64
tcpdump <= 128

reading / writing captures to a file

        
tcpdump -w capture_file  
tcpdump -r capture_file

More options

-t
-v

combine options


tcpdump -nnvvS src 10.5.2.3 and dst port 3389

Troubleshooting and Performance Tuning

system hung, Linux kernel panic analysis, filesystem failed.

Linux Server Hacks, Volume Two: Troubleshooting and Performance

#69 Find Resource Hogs with Standard Commands

The first thing you can debug the resource problem is log into the machine and run the top command:


Tasks: 368 total,   1 running, 240 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us,  1.8 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem :  3960656 total,   490284 free,  2295076 used,  1175296 buff/cache
KiB Swap:  2097148 total,  1409676 free,   687472 used.  1323712 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                               
20446 root      20   0       0      0      0 I   6.0  0.0   0:03.42 kworker/u16:23-                       
 1780 jerry     20   0  567372  45480  32776 S   3.0  1.1  55:06.71 Xorg                                  
 1894 jerry     20   0 3970804 154548  34504 S   2.3  3.9  32:30.89 gnome-shell                           
 8825 jerry     20   0  753108  29704  18288 S   2.3  0.7   0:04.34 gnome-terminal-                       
 1211 mysql     20   0 1489996   4068      0 S   0.3  0.1   0:44.82 mysqld                                
 1932 jerry     20   0  362680   4988   2868 S   0.3  0.1   0:49.15 ibus-daemon                           
 2372 jerry     20   0 1352908 320352 116632 S   0.3  8.1  47:27.69 chrome                                
 4847 jerry     20   0  806508 128588  73200 S   0.3  3.2   0:29.62 chrome                                
12658 jerry     20   0 1272228 364528 122448 S   0.3  9.2   2:53.93 chrome                                
20092 root      20   0       0      0      0 I   0.3  0.0   0:01.78 kworker/0:0-eve                       
20743 jerry     20   0   51764   4412   3528 R   0.3  0.1   0:00.33 top                                   
    1 root      20   0  225852   6188   3684 S   0.0  0.2   0:36.19 systemd

Tasks:

A process that is on the CPU (a running process with state R)
A process that is off the CPU (a not-running process)

Runnable state (R)
Sleeping state

Interruptable sleep state (S)
Un-interruptable sleep state(D)

Defunct or Zombie state(Z)

Born or forked
Ready to run or runnable
Running in user space or running in kernel space
Blocked, Waiting, Sleeping, in an Interruptable sleep, or in an Uninterruptable sleep
The process is sleeping, but it is present in main memory
The process is sleeping, but it is present in secondary memory storage (swap space on disk)
Terminated or stopped

T = stopped by job control signal
t = stopped by debugger during trace

%Cpu(s)

us, user : time running un-niced user processes
sy, system : time running kernel processes
ni, nice : time running niced user processes
id, idle : time spent in the kernel idle handler
wa, IO-wait : time waiting for I/O completion
hi : time spent servicing hardware interrupts
si : time spent servicing software interrupts
st : time stolen from this vm by the hypervisor

Steal time

a virtual CPU waits for a real CPU

the amount of CPU time needed by a guest virtual machine that is not provided

The VM kernel gets the steal metric from the hypervisor. ( in terms of losses)

Shut down the instance and move it to another physical server
If steal time remains high, increase the CPU resources
If steal time remains high, contact your hosting provider. Your host may be overselling physical servers.

Test to see the steal value changed:

start KVM


$ virsh start ubuntu18.04

Try logging into the VM from the host


$ ssh jerry@192.168.122.145


%Cpu(s):  0.3 us,  3.0 sy, 96.4 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                     
 1934 root      39  19  226624 115428  71280 R 99.7  5.7   8:02.91 unattended-upgr

Host status
%Cpu(s): 26.2 us,  0.8 sy,  0.0 ni, 71.9 id,  0.7 wa,  0.0 hi,  0.3 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                     
 3522 libvirt+  20   0 4802444 1.899g   4200 S 100.3 50.3  12:05.82 qemu-system-x86

KiB Mem :

Disk data is cached in the “page cache”

buffers+cache


       ( used - buff/cache ) / total

KiB Swap:

pages

Swapping

a page of memory is copied to the preconfigured space on the hard disk, called swap space, to free up that page of memory

total

free

used

avail

physical memory available

Listed below are top's fields (columns) for available processes .

%CPU -- CPU Usage

sort

%MEM
RES

RSlk
RSan
RSfd
RSsh
SHR

SWAP
USED
VIRT
COMMAND -- Command Name or Command Line

priorities

PR = 20 + NI


PID to renice [default pid = 20998]


renice 20 -u jerry

Display Specific User Process


$ top -u jerry

Highlight Running Process in Top

Set ‘Screen Refresh Interval’ in Top

Kill running process in Top
SORTING of task window

command


                command   sorted-field                  supported
                A         start time (non-display)      No
                M         %MEM                          Yes
                N         PID                           Yes
                P         %CPU                          Yes
                T        TIME+                         Yes

Get a snapshot of Top


top -n 1 -b

Monitor only processes with specified process IDs


$ top -p2100

Monitor memory in different mode

Add more fields
顯示thread


Threads: 1061 total,   1 running, 1007 sleeping,   0 stopped,   0 zombie

查看多核 CPU 個別core的負載


%Cpu0  : 16.8 us,  2.3 sy,  0.0 ni, 79.2 id,  0.7 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu1  : 12.8 us,  4.0 sy,  0.0 ni, 80.5 id,  1.3 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu2  : 16.9 us,  4.7 sy,  0.0 ni, 76.7 id,  1.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 18.9 us,  3.0 sy,  0.0 ni, 78.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3960624 total,   447004 free,  2951536 used,   562084 buff/cache
KiB Swap:  2097148 total,   882012 free,  1215136 used.   672356 avail Mem

顯示某個process使用哪個core

Fields Management for window

P = Last Used Cpu (SMP)


  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                P 
32027 jerry     20   0 1262872 284452  76560 S 11.4  7.2   0:02.23 ThreadPoolForeg        2
32146 jerry     20   0 1262872 284452  76560 S 10.4  7.2   0:00.86 ThreadPoolForeg        2
32096 jerry     20   0 1262872 284452  76560 S 10.1  7.2   0:01.06 ThreadPoolForeg        1

The next tool is vmstat. vmstat reports information about processes, memory, paging, block IO, traps, disks and cpu activity.


$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 915712 1730676  58020 577300    9   71   217   129 2001 1152 11  3 76 10  0

By default, vmstat produces output once. You can assign a value for delay (in seconds) after which the output is updated.
Analysis:

Swap

si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).

bi: Blocks received from a block device (blocks/s).
bo: Blocks sent to a block device (blocks/s).

System

in: The number of interrupts per second, including the clock.
cs: The number of context switches per second.

How to identify which process is taking heavy IO in linux?
Virtual memory is a memory management technique used by Linux that combine active RAM and inactive memory on the disk drive (hard disk / ssd) to form a large range of contiguous addresses.
A page fault occurs when a process accesses a page that is mapped in the virtual address space, but not loaded in physical memory. 中央處理器的記憶體管理單元發出中斷信號。通常情況下，用於處理此中斷的程式是作業系統的一部分。如果作業系統判斷此次存取是有效的，那麼作業系統會嘗試將相關的分頁從硬碟上的虛擬記憶體檔案中調入記憶體。而如果存取是不被允許的，那麼作業系統通常會結束相關的行程。
A major fault occurs due to disk access, a minor fault occurs due to page allocation (already in memory, but it isn't allocated to that process).
Use "ps" or "top" can get the statistics of page fault.

Disks

You can use pam_limits, or the ulimit utility to keep users from going overboard after they log in the system.
The df -h command shows disk usage/free disk statistics for all mounted filesystems. To find out the identity of the disk hog for a folder:


  du  /home/* | sort –n

iostat - Report CPU statistics and IO statistics for devices and partitions.
When the command "iostat" is run without arguments, it generates a detailed report containing information since the system was booted. While each subsequent report covers the time period since the last report was generated.
You can provide two optional parameters to change this:


  iostat [option] [interval] [count]

interval
count

iostat

The iostat command generates two types of reports,

the CPU Utilization report
the Device Utilization report


$ iostat -d
Linux 5.0.0-37-generic (jerry-Latitude-E6410)  廿廿年一月四日  _x86_64_ (4 CPU)

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              12.65       208.85       180.56    3623705    3132796

Device
tps
kB_read/s
kB_wrtn/s
kB_read
kB_wrtn

Bandwidth

The command lsof lists open files including socket files.

$ lsof -U -Pi | grep IP
COMMAND     PID  USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME

chrome     2372 jerry   90u  IPv4            6179770      0t0     UDP jerry-Latitude-E6410:59666 
chrome     2372 jerry  205u  IPv4            5864118      0t0     UDP 224.0.0.251:5353 
chrome     2410 jerry   30u  IPv6            5861252      0t0     TCP 2001-b011-5003-14ec-2976-d29d-44a9-5cd6.dynamic-ip6.hinet.net:32836->th-in-xbc.1e100.net:5228 (ESTABLISHED)
chrome     2410 jerry   33u  IPv6            6180782      0t0     UDP 2001-b011-5003-14ec-2976-d29d-44a9-5cd6.dynamic-ip6.hinet.net:37080->tsa03s06-in-x0e.1e100.net:443 
chrome     2410 jerry   42u  IPv4            5865030      0t0     UDP 224.0.0.251:5353

#70 Reduce Restart Times with Journaling Filesystems

Computer systems can only successfully mount and use filesystems if they can be sure that all of the data structures in each filesystem are consistent. "consistency" means that:

all of the disk blocks that are actually used in some file or directory are marked as being in use
all deleted blocks aren’t linked to anything other than the list of free blocks
all directories in the filesystem actually have parent directories
...

Each filesystem has its own version of fsck that understands and “does the right thing” for that particular filesystem. When filesystems are mounted, they are marked as being in use (“dirty”). When a system is shut down normally, all its on-disk filesystems are marked as being consistent (“clean”) when they are unmounted. When the system reboots, filesystems that are marked as being clean do not have to be checked before they are mounted, which saves lots of time in the boot process. A journaling filesystem is a filesystem that maintains a special file called a journal that is used to repair any inconsistencies that occur as the result of an improper shutdown of a computer. Journaling filesystems write metadata (i.e., data about files and directories) into the journal that is flushed to the HDD before each command returns. If a given set of updates have not been fully committed to the filesystem, the system will read the journal then roll up to the most recent point of data consistency. The most commonly used journaling filesystem for Linux is the third extended filesystem (ext3fs), the newest journaling filesystem for Linux is ext4,

#71 Optimize Your System with sysctl

The files under /proc/sys/ are often collectively referred to as the sysctl interface, because they can be written to, and changes made to the files will be picked up by the running kernel without rebooting. Now, sysctl is also a command that allows administrators to easily configure kernel parameters at runtime. Display all values currently available,

$ sysctl -a

this returned many “key=value”-formatted records. The keys on the left are dotted representations of file paths under /proc/sys. For example, the setting for net.ipv4.ip_forward can be found in /proc/sys/net/ipv4/ip_forward. You can specify what you want as an argument to sysctl :

read

$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0

write

$ sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1

If you want to make a permanent change, you can put your custom settings into the /etc/sysctl.conf file.

#74 Profile Your Systems Using /proc

#!/bin/bash
echo ""
echo "#########BASIC SYSTEM INFORMATION########"
echo HOSTNAME: `cat /proc/sys/kernel/hostname`
echo DOMAIN: `cat /proc/sys/kernel/domainname`
echo KERNEL: `uname -r`
top -b | head -8

"top -b" starts top in Batch mode, which could be useful for sending output from top to other programs or to a file.
"head -8" prints the first 8 lines of each FILE to standard output.

echo "######## FILESYSTEM INFORMATION #########"
echo ""
echo "SUPPORTED FILESYSTEM TYPES:"
echo ----------------------
echo `cat /proc/filesystems | awk -F'\t' '{print $2}'`
echo ""
echo "MOUNTED FILESYSTEMS:"
echo ----------------------
cat /proc/mounts

#75 Kill Processes the Right Way

First use the ps –ef command to determine the process ID, then simply type this command:

$ kill -pid

The “zombie processes" is that child process has "died" but has not yet been "reaped" by its parent process(use wait() to read child's exit status). Also, unlike normal processes, the kill command has no effect on a zombie process. In order to prevent “zombie processes", you should make sure that you kill any and all child processes before you kill their respective parent processes. When a child process is terminated, the kernel keeps some information about it in the process table (including its exit status). The parent needs to read the exit status of the child before it removes the child’s entry from the table. When a process is dead, all resources associated with it are deallocated so that they can be reused by other processes. The problem caused by the zombie processes is: there is only one process table per system and this table has a limited number of unique processes identifiers (PIDs). If you have too many entries in this table, it won’t be able to create a new one. By looking at the column labeled PPID(parent process ID) on the output of "ps -ef", you can find the child process of the PID. Alternatively, you can attempt to kill all the processes within the same process group using killall.

Logfiles and Monitoring

#78 Avoid Catastrophic Disk Failure

ATA and SCSI drives have supported a standard mechanism for disk diagnostics called “Self Monitoring, Analysis, and Reporting Technology” (SMART), aimed at predicting hard drive failures. The smartmontools project (http://smartmontools.sourceforge.net) produces a SMART monitoring daemon called smartd and a command-line utility called smartctl, which can do most things on demand that the daemon does in the background periodically. To find the information of a hard drive:


$ sudo smartctl -i /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.0.0-37-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black
Device Model:     WDC WD2500BEKT-75A25T0
Serial Number:    WD-WXQ1A80V7620
LU WWN Device Id: 5 0014ee 655d0b10e
Firmware Version: 01.01A01
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Dec  9 18:43:42 2019 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

To ask the drive about its overall health:


$ sudo smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.0.0-37-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

To get all information,


$ sudo smartctl -a /dev/sda

#79 Monitor Network Traffic with MRTG

The Multi-Router Traffic Grapher provides a quick visual snapshot of network traffic, making it easy to find and resolve congestion. Each time you run MRTG, you’ll need to specify the location from which you want it to read the config file.

#86 Fine-Tune the syslog Daemon

dmesg is used to examine or control the kernel ring buffer. The default action is to display all messages from the kernel ring buffer.
klogd reads the messages from either /proc/kmsg or calls sys_syslogd. It provides kernel log data streams to another daemon called syslogd

A system daemon syslog listens for messages on a Unix domain socket named /dev/log. Based on classification information in the messages and its configuration file (usually /etc/syslog.conf), syslogd routes them in various ways. Some of the popular routings are:

Write to the system console
Mail to a specific user
Write to a log file( /var/log/messages)
Pass to another daemon
Discard

syslog can handle messages from the kernel itself. But the kernel doesn’t write to /dev/log; rather, another daemon (sometimes called “ klogd”) extracts messages from the kernel and passes them on to syslog as any other process would

Therefore, both log information are directed to the same log file /var/log/messages.
dmesg is most useful in capturing boot-time messages before syslogd started,

rsyslogd is derived from the sysklogd package. Support of both internet and unix domain sockets enables this utility to support both local and remote logging.

USING Security Enhanced Linux (SELinux)

Chapter 1. Getting started with SELinux

Discretionary(自由裁量權) access control (DAC) is a means of restricting access to objects based on the identity of subjects and/or groups to which they belong.
Mandatory(強制性的) Access Control (MAC) constrains the ability of a subject ( or initiator ) to access or generally perform some sort of operation on an object ( or target ).
In practice,

a subject is usually a process or thread;
objects are constructs such as files, directories, TCP/UDP ports, shared memory segments, IO devices, etc.

Any operation by any subject on any object is tested by the kernel against the set of authorization rules (aka policy) to determine if the operation is allowed.
For example: May a web server access files in users' home directories?
This can enable system administrators to create comprehensive and fine-grained security policies, such as restricting specific applications to only viewing log file.
The main difference between MAC and DAC:

With mandatory access control (MAC)

users do not have the ability to override the policy

with discretionary access control (DAC)

SELinux implements MAC.
Every process and system resource has a special security label called an SELinux label/context.

NOTE: Remember that SELinux policy rules are checked after DAC rules. SELinux policy rules are not used if DAC rules deny access first

1.2. Benefits of running SELinux

SELinux provides the following benefits:

All processes and files are labeled.
Fine-grained access control.
SELinux policy is administratively-defined and enforced system-wide.

1.4. SELinux architecture and packages

Linux Security Modules (LSM) is a framework(interface) that allows the Linux kernel to support a variety of computer security models. LSM is built into the Linux kernel. SELinux and AppArmor are implementations of LSM. Both SELinux and AppArmor provide a set of tools to isolate applications from each other to protect the host system from being compromised.

Only a single LSM is allowed to be operational at a time.

The SELinux subsystem in the kernel is driven by a security policy which is controlled by the administrator and loaded at boot. All security-relevant, kernel-level access operations on the system are intercepted by SELinux and examined in the context of the loaded security policy.

By default, Ubuntu uses AppArmor and not SeLinux, which is similar in terms of performance but rather popular in terms of simplicity.

AppArmor has to be disabled prior to installing SeLinux to avoid any conflicts. Use the following instructions to disable AppArmor:


$ sudo systemctl status apparmor
● apparmor.service - AppArmor initialization
   Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sun 2020-01-12 10:17:46 CST; 29min ago
     Docs: man:apparmor(7)
           http://wiki.apparmor.net/
  Process: 477 ExecStart=/etc/init.d/apparmor start (code=exited, status=0/SUCCESS)
 Main PID: 477 (code=exited, status=0/SUCCESS)

 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]:  * Starting AppArmor profiles
 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]: Skipping profile in /etc/apparmor.d/disable: usr
 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]: Skipping profile in /etc/apparmor.d/disable: usr
 一  12 10:17:54 jerry-Latitude-E6410 apparmor[477]:    ...done.
 一  12 10:17:34 jerry-Latitude-E6410 systemd[1]: Starting AppArmor initialization...
 一  12 10:17:46 jerry-Latitude-E6410 systemd[1]: Started AppArmor initialization.

$ sudo systemctl stop apparmor

$ sudo systemctl disable apparmor

Note, SELinux is not tested for Ubuntu and not recommended run on Ubuntu. Try RedHat.

由於 SELinux 是在kernel中實踐的，應用程式無須被特別編寫或重寫便可以採用 SELinux。當然，如果一個程式特別留意稍後所提及的 SELinux 錯誤碼，它的運作可能會更暢順。假若 SELinux 攔阻了一個行動，它會以一個標準的（至少是常規的）「拒絕存取」類錯誤來匯報給該應用程式。然而，很多應用程式不會測試系統函數所傳回的錯誤碼，因此它們也許不會輸出訊息解釋問題所在，或者輸出錯誤訊息。

1.5. SELinux states and modes

SELinux can run in one of three modes:

Enforcing mode
Permissive mode
Disabled mode

Use the setenforce utility to change between enforcing and permissive mode and getenforce to get the setting.


# getenforce
Enforcing

# setenforce 0
# getenforce
Permissive

# setenforce 1
# getenforce
Enforcing

To get the status of a system running SELinux:


$ sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          permissive
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      31

Chapter 2. Changing SELinux states and modes

While enabling SELinux on systems that previously had it disabled, to avoid problems, follow this procedure:

Enable SELinux in permissive mode.
Reboot your system.
Check for SELinux denial messages.

2.2.1. Changing to permissive mode

Modify /etc/selinux/config

permissive

Reboot the system

2.2.2. Changing to enforcing mode

Modify /etc/selinux/config

enforcing

Reboot the system

2.3. Disabling SELinux

Modify /etc/selinux/config

disabled

Reboot the system

2.4. Changing SELinux modes at boot time

On boot, you can set several kernel parameters to change the way SELinux runs:

enforcing=0

permissive

selinux=0
autorelabel=1

Chapter 3. Configuring SELinux for applications and services with non-standard configurations

3.1 Customizing the SELinux policy for the Apache HTTP server in a non-standard configuration

Prerequisites: the Apache HTTP server is configured to listen on TCP port 3131 . Modify /etc/httpd/conf/httpd.conf:


Listen 3131

Procedure:

Start the httpd service and check the status


$ systemctl start httpd
$ systemctl status httpd
...
   Status: "Running, listening on: port 3131"
...


$ journalctl | grep avc
 一  14 16:02:32 localhost.localdomain audit[2398]: AVC avc:  denied  { name_bind } for  pid=2398 comm="httpd" src=3131 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket permissive=1

re-label


# touch /.autorelabel
# reboot

semanage — SELinux Policy Management tool

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing.
Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).

Introduction

Hyper-Threading

A single physical CPU core with hyper-threading appears as two logical CPUs to an operating system.
Hyper-threading allows the two logical CPU cores to share physical execution resources. This can speed things up somewhat — if one virtual CPU is stalled and waiting, the other virtual CPU can borrow its execution resources.

Multiple Cores

A dual-core CPU has two central processing units, so it appears to the operating system as two CPUs. A CPU with two cores could run two different processes at the same time. There only needs to be a single CPU socket with a single physical CPU to with four different CPUs,

Multiple CPUs

To add additional processing power to computers by adding additional CPUs, this requires a motherboard with multiple CPU sockets and additional hardware to connect those CPU sockets to the RAM and other resources.
Most multi-processor computers are considered Symmetric Multi-Processors(SMP) as each processor is equal and has equal access to all system resources (e.g., memory and I/O busses).
As SMP systems have increased their processor count, the system bus has increasingly become a bottleneck.

CPU Affinity(親和力)

The ability in Linux to bind one or more processes to one or more processors called CPU affinity.
The idea is to say “always run this process on processor one” or “run these processes on all processors but processor zero”.
CPU affinity is a scheduler property that "bonds" a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs.
taskset is used to set or retrieve the CPU affinity of a running process given its pid, or to launch a new command with a given CPU affinity.

NUMA

Historically, all memory on AMD64 and Intel 64 systems is equally accessible by all CPUs. Known as Uniform Memory Access (UMA), access times are the same no matter which CPU performs the operation.
Non-Uniform Memory Access (NUMA) refers to multiprocessor systems whose memory is divided into multiple memory nodes. The access time of a memory node depends on the relative locations of the accessing CPU and the accessed node.
Thus, there are multiple physical regions of memory, but all memory is tied together into a single cache-coherent physical address space. The resulting system has the property such that for any given region of physical memory, some processors are closer to it than other processors. Conversely, for any processor, some memory is considered local (i.e., it is close to the processor) and other memory is remote.

To maximize performance on a NUMA platform, Linux must take into account the way the system resources are physically laid out. This includes information such as which CPUs are on which node, which range of physical memory is on each node, and what node an I/O bus is connected to. This type of information describes the topology of the system.

NUMA把系統切成數個節點 (node)，每個處理器及記憶體就位在某一個節點上，當處理器存取同一個節點的記憶體時，可以有較高的存取速度；而存取其他節點的記憶體時，就需要透過節點間的資料傳遞，會耗費較多時間。

Linux NUMA Support

作業系統中為了提高記憶體存取的效率，會針對硬體的NUMA配置來設定記憶體存取的策略 (policy)，並提供NUMA相關的程式介面 (API) 來查詢系統NUMA配置和修改存取策略。

Linux manages memory in zones. How Linux has arranged memory can be determined by looking at /proc/zoneinfo.
On boot-up, Linux will detect the organization of memory and then create zones that map to the NUMA nodes and DMA areas as needed.

NUMA Memory Allocation Policies

How memory is allocated from the nodes in a system is determined by a memory policy:
The most important memory policies are:

interleave

default policy

distribute memory accesses

local allocation

default policy

local allocation

Basic Operations On Process Startup

The main tool used to set up the NUMA execution environment for a process is numactl.
numactl controls NUMA policy for processes or shared memory. It is possible to restrict processes to a set of processors, as well as to a set of memory nodes.

The hardware NUMA configuration of a system can be viewed by using


$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 3867 MB
node 0 free: 931 MB
node distances:
node   0 
  0:  10

numastat displays per-node NUMA hit and miss system statistics from the kernel memory allocator.


$ numastat
                           node0
numa_hit               103812126
numa_miss                      0
numa_foreign                   0
interleave_hit             33350
local_node             103812126
other_node                     0

numa_hit
numa_miss
numa_foreign
interleave_hit
local_node
other_node

The information about a process's NUMA memory policy and allocation can be displayed via /proc/[pid]/numa_maps. For ex.,


$ sudo cat /proc/2907/numa_maps 
3473fdedd000 default
3473fdede000 default anon=14265 dirty=14265 active=6585 N0=14265 kernelpagesize_kB=4
...
7fb1ad2ca000 default file=/lib/x86_64-linux-gnu/ld-2.27.so anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
7fb1ad2cb000 default anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
7ffc1dbb3000 default stack anon=29 dirty=29 active=20 N0=29 kernelpagesize_kB=4
7ffc1dbe2000 default
7ffc1dbe5000 default

Each line contains information about a memory range used by the process.

The first field of each line shows the starting address of the memory range.

maps

The second field shows the memory policy currently in effect for the memory range.
anon=[pages]
stack
file=[filename]
dirty=[pages]
N[node]=[nr_pages]

Create a Linux Swap File

When a Linux system runs out of RAM, inactive pages are moved from the RAM to the swap space.
Swap space can take the form of either a dedicated swap partition or a swap file.

How to add Swap File

Create a file that will be used for swap


sudo dd if=/dev/zero of=/swapfile bs=1024 count=1048576

Only the root user should be able to write and read the swap file.


sudo chmod 600 /swapfile

Use the mkswap utility to set up the file as Linux swap area


sudo mkswap /swapfile

Enable the swap with the following command


sudo swapon /swapfile

To make the change permanent open the /etc/fstab


/swapfile swap swap defaults 0 0

To verify that the swap is active


sudo swapon --show

Cpusets on Linux

在個人使用的環境下，想要調效多核心環境的計算效能，Linux核心提供的CPUSET功能會是一個簡單而方便的操作方式。
Cpusets are logical, hierarchical groupings of CPUs and units of memory.
The cpuset facility is primarily a workload manager tool permitting a system administrator to restrict the number of processor and memory resources that a process or set of processes may use.

A cpuset defines a list of CPUs and memory nodes.
A process contained in a cpuset may only execute on the CPUs in that cpuset and may only allocate memory on the memory nodes in that cpuset.

Linux 2.6 Kernel Support for Cpusets

Each task has a link to a cpuset structure that specifies the CPUs and memory nodes available for its use.
The kernel task scheduler is constrained to only schedule a task on the CPUs in that task's cpuset.
The kernel memory allocation mechanism is constrained to only allocate physical memory to a task from the memory nodes in that task's cpuset.

Cpuset Facility Capabilities

The cpuset facility allows you and your system service software to do the following:

Create and delete named cpusets.
Decide which CPUs and memory nodes are available to a cpuset.
Attach a task to a particular cpuset.
Identify all tasks sharing the same cpuset.
Exclude any other cpuset from overlapping a given cpuset, thereby, giving the tasks running in that cpuset exclusive use of those CPUs and memory nodes.
Perform bulk operations on all tasks associated with a cpuset, such as varying the resources available to that cpuset or hibernating those tasks in temporary favor of some other job.
Perform sub-partitioning of system resources using hierarchical permissions and resource management.

Initializing Cpusets

The kernel, at system boot time, initializes one cpuset, the root cpuset, containing the entire system's CPUs and memory nodes. Subsequent user space operations can create additional cpusets.
Mounting the cpuset virtual file system (VFS) at /dev/cpuset exposes the kernel mechanism to user space.

How to Determine if Cpusets are Installed

Check if the /proc/filesystems contains cpusets,


$ grep cpuset /proc/filesystems
nodev cpuset

If the /dev/cpuset/tasks file is not present on your system, it means the cpuset file system is not mounted. you can mount the cpuset file system, as follows:


$ sudo mkdir /dev/cpuset
$ sudo mount -t cpuset cpuset /dev/cpuset
$ ls /dev/cpuset
cgroup.clone_children  cpuset.effective_cpus  cpuset.memory_pressure          cpuset.sched_load_balance        tasks
cgroup.procs           cpuset.effective_mems  cpuset.memory_pressure_enabled  cpuset.sched_relax_domain_level
cgroup.sane_behavior   cpuset.mem_exclusive   cpuset.memory_spread_page       machine
cpuset.cpu_exclusive   cpuset.mem_hardwall    cpuset.memory_spread_slab       notify_on_release
cpuset.cpus            cpuset.memory_migrate  cpuset.mems                     release_agent

Cpuset File System Directories

Each cpuset is represented by a directory in the cpuset virtual file system.
The state of each cpuset is represented by small text files in the directory for the cpuset. These files may be read and written using traditional shell utilities or using ordinary file access routines from programming languages.
Descriptions of the files in the cpuset directory,

tasks
notify_on_release

WiFi Debug

How to hide MAC address?

You are not going to make modification on hardware, you going to change RAM. When the computer starts, MAC address loads in RAM and we going to change already loaded MAC address.

install macchanger


$ sudo apt-get install macchanger

Turn off your network interface


# ifconfig eno1 down

Use macchanger to randomly generate new MAC address and assign it to eno1


$ sudo macchanger -r eno1
Current MAC:   5c:26:xx:xx:xx:xx (Dell Inc.)
Permanent MAC: 5c:26:xx:xx:xx:xx (Dell Inc.)
New MAC:       f2:68:fb:e3:61:54 (unknown)

Wireshark

Wireshark can be used to capture packet directly:


$sudo apt-get install wireshark
$sudo wireshark -i wlan0 -w

Wireless modes

Basically, the default "managed" mode of your networking card allows the networking device to receive packets that are sent to its MAC address.


$ sudo iwconfig wlp2s0
wlp2s0    IEEE 802.11  ESSID:"Jerry_DSL-5G"  
          Mode:Managed  Frequency:5.745 GHz ...

You can let the card enter "monitor" mode by typing in these commands


$ sudo ifconfig wlp2s0 down
$ sudo iwconfig wlp2s0 mode monitor
$ sudo ifconfig wlp2s0 up
$ sudo iwconfig wlp2s0

Aircrack-ng

Aircrack-ng is a whole suite of tools for Wireless Security Auditing. It can be used to monitor, test, crack or attack Wireless Security Protocols like WEP, WPA, WPA2. Aircrack-ng is command line based and is available for Windows and Mac OS and other Unix based Operating systems.
We’ll only look at some important tools that are used more often in Wireless Security testing.

airodump-ng is used for packet capturing of raw 802.11 frames for the intent of using them with aircrack-ng.

Installation


sudo apt-get install -y aircrack-ng

Usage

kill all the processes running on wireless card using airmon-ng


$sudo airmon-ng check kill

start Monitor mode on wireless card


$sudo airmon-ng start wlp2s0

Run ‘iwconfig’ again to list wireless details.


$sudo iwconfig

search nearby Wireless Access Points and their properties.


$sudo airodump-ng wlan0mon0

Start airodump-ng on the target AP


$sudo airodump-ng --channel [channel] –bssid [bssid] –write [file-name] wlan0mon0

Wait for a client to connect to the access point, or deauthenticate a connected client (if any) so that their system will connect back automatically.


$sudo aireplay-ng –deauth [number of deauth packets] –a [bssid] –c [target_client_mac] wlan0mon0

After all devices are disconnected from that Access Point, running airodump-ng will capture the handshake

airodump-ng

hack password

http://www.hackreports.com/2013/05/biggest-password-cracking-wordlist-with.html
https://crackstation.net/buy-crackstation-wordlist-password-cracking-dictionary.htm


$sudo aircrack-ng [handshake_filename] –w [dictionary_wordlist] wlan0mon0

stop Monitor mode on wireless card


$ sudo airmon-ng stop wlan0mon

The Linux-PAM configuration file

/etc/pam.conf is made up of a list of rules:


service type control module-path module-arguments

service

service

/etc/pam.d/


/etc/pam.d
├── chfn
.
├── cron
.
├── login
.
├── passwd
.
└── systemd-user

type

account
auth
password
session

control
module-path
module-arguments

Run a shell script as another user without password


sudo -H -u otheruser bash -c 'echo "I am $USER, with uid $UID"'

-H
-u

You can modify the /etc/pam.d/su file to allow su without password.
With the following modification in /etc/pam.d/su, any user that was part of group somegroup could su to otheruser without a password:


auth       sufficient pam_rootok.so
auth       [success=ignore default=1] pam_succeed_if.so user = otheruser
auth       sufficient   pam_succeed_if.so use_uid user ingroup somegroup

ip COMMAND CHEAT SHEET

Discussions of Networking Issues

Unable to use ping as regular user

問題描述：原因分析：

Description

Linux系統下，普通用戶使用ping命令返回“ping: icmp open socket: Operation not permitted”錯誤輸出，但root用戶可以正常使用該命令。


$ ip a
5: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:48:5f:45:73 brd ff:ff:ff:ff:ff:ff
    inet 192.168.168.170/24 brd 192.168.168.255 scope global dynamic noprefixroute enp4s0
       valid_lft 85616sec preferred_lft 85616sec
    inet6 2001:b011:5003:19dc:5b27:dfe9:ba8d:e6c/64 scope global dynamic noprefixroute 
       valid_lft 85618sec preferred_lft 85618sec
    inet6 fe80::7623:d501:113b:8e77/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
       
$ ping 192.168.168.1 -c 2 -w 3 -I enp4s0
ping: SO_BINDTODEVICE enp4s0: Operation not permitted

$ sudo ping 192.168.168.1 -c 2 -w 3 -I enp4s0
PING 192.168.168.1 (192.168.168.1) from 192.168.168.170 enp4s0: 56(84) bytes of data.
64 bytes from 192.168.168.1: icmp_seq=1 ttl=64 time=0.464 ms
64 bytes from 192.168.168.1: icmp_seq=2 ttl=64 time=0.499 ms

Solution


$ sudo chmod u+s /usr/bin/ping
$ ping 192.168.168.1 -c 2 -w 3 -I enp4s0
PING 192.168.168.1 (192.168.168.1) from 192.168.168.170 enp4s0: 56(84) bytes of data.
64 bytes from 192.168.168.1: icmp_seq=1 ttl=64 time=0.497 ms

Analysis

ping命令在運行中采用了ICMP協議，需要發送ICMP報文。但是只有root用戶才能建立ICMP報文。而正常情況下，ping命令的權限應為-rwsr-xr-x，即帶有suid的文件，一旦該權限被修改，則普通用戶無法正常使用該命令。

Changes/Enable SysctlPingGroupRange

Set Up the NFS server

server:

Install the NFS server


$ sudo apt install nfs-kernel-server

NFS server's configuration


/home/jerrlee/work2 10.19.108.147/8(rw,sync,no_subtree_check,fsid=0,insecure_locks,insecure,no_root_squash)

Modify folder's ownership


$ sudo chown jerrlee /home/jerrlee/work2

start the NFS server


$ sudo systemctl restart nfs-kernel-server
$ sudo service nfs-server status

Export server's directories


$ sudo exportfs -a

Setup firewall


   sudo ufw allow  nfs
   sudo ufw allow sunrpc
   sudo ufw allow 111
   sudo ufw allow from 10.19.108.147/8  
   sudo ufw status

client:

Install the client


   sudo apt install nfs-utils
   sudo apt install nfs-common

Check NFS shares exported by the NFS server


$ showmount -e build-server
Export list for build-server:
/home/jerrlee/work2 10.19.108.147/8

setup the firewall


   sudo ufw allow ssh
   sudo ufw allow nfs
   sudo iptables --flush
   sudo ufw allow from 10.19.108.147/8

test the connections with the server


   nc -v -u 10.19.108.147 111
   nc -v  10.19.108.147 111
   nc -v 10.19.108.147 2049
   nc -v -u 10.19.108.147 2049
   rpcinfo -p 10.19.108.147

Mout the NFS share


   sudo mount -o v3 10.19.108.147:/home/jerrlee/work2 /home/jerry/work2

/etc/fstab


   sudo mount -o v3 10.19.108.147:/home/jerrlee/work2 /home/jerry/work2