5月 12, 2022

ARM Embedded System

Embedded Systems with ARM Cortex-M Microcontrollers in Assembly Language and C (Third Edition)

preface

The book introduces basic programming of ARM Cortex-M cores in assembly and C at the register level, and the fundamentals of embedded system design.
It presents basic concepts such as data representations (integer, fixed-point, floating-point), assembly instructions, stack, and implementing basic controls and functions of C language at the assembly level.
It covers advanced topics such as interrupts, mixing C and assembly, direct memory access (DMA), system timer (SysTick), multi-tasking, SIMD instructions for digital signal processing (DSP), and instruction encoding/decoding.
The book also gives detailed examples of interfacing peripherals, such as general purpose I/O (GPIO), LCD driver, keypad interaction, stepper motor control, PWM output, timer input capture, DAC, ADC, real-time clock (RTC), and serial communication (USART, I2C, SPI, and USB).

1. See a Program Running

This chapter shows how a program is gnerated and executed.

1.1 Translate a C Program into a Machine Program

Compiliers first perform some analysis on the source program, and then create an intermediate representation(IR).
FOr C programs, the intermediate program is similar to a assembly program.
Finally, the compilers translate the assemble program into a machine program.(binary executable)

The binary machine program follows a standard called executable and linkable format(ELF) which most ARM-based system support .
ELF defines 2 interfaces:

a linkable interface
an executable interface

The executable interface provides 2 separate logical views:

load view
execution view

a test segment
a read-only data segment
a read-write data segment
a zero-initialized data segment

1.2 Load a Machine Program into Memory

1.2.1 Harvard Architecture and Non Neumann Architecture

There are 2 ype of architecture in memory accessing:

Because the data and instruction memory are small enough to fit in the same 32-bit memory address space, they often share the same memory adress bus.
For ex.,256 KB data memory and 4 KB instruction memory can share the address bus,

1.2.2 Creating Runtime memory Image

ARM Cortex-M3/M4/M7 processors are Harvard computer architecture, the instruction memory(flash) and data memory(SRAM) are built into th eprocessor chip.

A simple example shows how the Harvard architecture loads a program to start the execution,

When the processor boots successfully, the 1st instruction of the program is loaded from the instruction memory into the processor, and the program starts to run.

The memory map is pre-defined by the chip manufacture and is not programmable usually.
For ex., an example memory map of the 4 GB memory space:

The processor allocates memory addresses for each internal or external peripherial.
The peripherial has a set of registers and may contain a small memory, the processor maps the register and memory of all peripherials to the same memory addressspace.
To interface a peripherial, the processor uses regular memory access instructions to Read/Wrote pre-defined addresses for this peripherial.
This method is called memory-mapped IO.

1.3 registers

All registers are of the same size and typically hold 16, 32, or 64 bits.
A processor core has 2 types of registers: generail purpose and special purpose registers.

1.3.1 Reusing Registers to Improve Performance

Some data items are accessed more frequently.
Therefore, most compiliers try to place the value of frequently or recently accessed data variables and memory addresses in registers for performance optimization.
Processor architecture design may use caching and prefetching to speed up the performance.

The number of registers on a processor is often small:

registers always exhibt the highes temperature
instruction's length to encode registers

2. Data Representation

3. ARM Instruction Set Architecture

4. Arithmetic and Logic

5. Load and Store

6. Branch and Conditional Execution

7. Structured Programming

8. Subroutines

9. 64-bit Data Processing

10. Mixing C and Assembly

11. Interrupt

12. Fixed-point and Floating-point Arithmetic

13. Instruction Encoding and Decoding

14. General-purpose I/O

15. General-purpose Timers

16. Stepper Motor Control

17. Liquid-crystal Display (LCD)

18. Real-time Clock (RTC)

19. Direct Memory Access (DMA)

20. Analog-to-Digital Converter (ADC)

21. Digital-to-Analog Converter (DAC)

22. Serial Communication Protocols

23. Multitasking

24. Digital Signal Processing

Appendix A: GNU Compiler

Short Lectures

1. Why use Two's Complement?

2. Carry flag for unsigned addition and subtraction

3. Overflow flag for signed addition and subtraction

4. C Pointer

5 Memory-mapped I/O

This short video explains what is memory mapped I/O.
Usually, each on-chip peripheral device has a few registers, such as control registers, status registers, data input registers, and data output registers.
In general, there are 2 approaches to exchange data between the processor core and a peripheral device:

Port-mapped I/O

out

Memory-mapped I/O

I/O devices

a memory address may refer to either a portion of physical RAM, or instead to memory and registers of the I/O device


    LDR/STR Reg, [Reg, #imm]

Therefore, memory-mapped I/O is a more convient way to interface I/O devices.

Here is an example of memory mapped I/O.

Suppose we want to set the output of a GPIO pin to high, software can use the store instruction STR to set the corresponding bit in GPIO data output register to 1.
When you write to this special memory location 0x48000014, the data you write is sent to the corresponding I/O device.

The memory address of ARM Cortex-M has a total of 32 bits, supporting 4GB of memory space.
The memory space is divided into six different pre-defined regions.

Each region is given for recommanded usage.

The 1st region is code region

program

on-chip

The 2nd region is SRAM

heaps and stacks

The 3rd region is peripheral

Advanced High Performance Bus

Advanced Peripherial Bus

on-chip peripherals

The 4-th region is for External Device
The 5-th region is External RAM

The 6-th region is system region

We will use GPIO on STM32L4 as an example to illustrate the concept of memory-mapped I/O.
For ex., on STM32L4, the registers of GPIO Port A, are mapped to a small memory region starting at 0x4800000.
Let's take a closer look at the memory map for GPIO Port A.

Each port has 12 registers, and each register has 4 bytes.
While a total 1KB space is reserved for Port A, only 48 bytes are used.
Within this 48 bytes memory region, the GPIO mode register MODER is mapped to the lowest memory adress, and the GPIO analog switch control register(ASCR) is mapped to the highest memory address.
If we want to set the output of pin#14 of the GPIO port A to high, we need to set bit 14 of the output data register(ODR) of GPIO port A to 1.

The output data register (ODR) of Port A on STM32L4 are mapped to the memory addresses from 0x48000014 to 0x48000017.
If little endian is used, the highest memory address holds the most significant 8 bits, and the lowest memory address holds the least significant 8 bits.
This can be set using the following C statement:

A sequence of load, modify, and store operations are performed in the above C statement*

this statement casts the memory address to a memory pointer, which points to an 32-bit unsigned integer.
the deference operator retrieves the ODR register value as a 32-bit integer
a bit-wise operation is performed to modify this unsigned integer value
the updated value is stored back to the ODR register via the deferencing

This memory block of PORT can be represented by using a C struct,

Note that we put volatile qualifier on each register.
When a variavle is declared as volatile, the compiler is informed that even though no statements in the program appear to change it, the value might still change.
Typically, compilers minimize the number of memory accesses , by storing the memory value in a register, and then repeatedly using it without accessing the memory.
The volatile qualifier on a variable prevents the compilier from making such optimization on this variable.

6. GPIO Output: Lighting up a LED

7. GPIO Input: Interfacing joystick

8. LCD Driver

9. Interrupts

This short video will explain how interrupts work on ARM Cortex-M microprocessors. Us the STM32 L4 discovery kit as an ex., there are 2 LEDs and a joystick with 4 push buttons.

If we want to develop a software: if a button is prssed, the software turns on the red LED. There are 2 ways to monitor the logic state on an input pin which is attached to a push button:

polling
interrupt

In the memory address space of ARM Cortex-M, there is a SRAM region.
If the memory address is 32 bits, it can support 4GB of memory space.
The memory space is divided in 6 pre-defined regions and each region has suggested usage:

The internal SRAM is divided into several segments.

Initialized data

Zero-initialized data
Heap
stack

The stack and the heap are located at the opposite end of the free memory region.
They grow in the opposite direction.
When the stack meets the heap, free memory space is exhausted.

While the code space can have as large as half a GB in the address space, much of this space is reserved.
For ex., STM32L4 chip has only 1MB on-chip flash memory which starts at 0x08000000 and ends at 0x080FFFF.
in addition, a small flash memory region starting at 0x08000000 is mapped to the lowest memory region starting at the address 0.

This mapping region includes the initial value for the main stack pointer, and the interrupt vector table.
The Nested Vector Interrupt Controller(NVIC) prioritizes and handle all interrupts.

When we press the push button connected to the pin PA3, the HW generates an electrical signal, called interrupt request, EXTI3.
When NVIC receives the interrupt request, it forces the processor to jump to and execute a special piece of code, called an interrupt service routine or an interrupt handler.
The entry points of all interrupt service routines are stored in a special table, called an interrupt vector table.
The interrupt vector table is stored at a pre-defined area in the memory.
For ARM Cortex processors, the interrupt vector table starts at the memory address 0x0004.
By default, the interrupt vector table is mapped to the lowest address of the internal flash memory.
However, software can re-map it to a different location, such as internal SRAM.

The interrupt vector table holds an array of memory addresses.
Each address is the starting address of the interrupt service routine.
The interrupt number is used to index the interrupt table.

The reset handlerr contains the function pointer which is called when the processor is rest.
When the processor is in reset, the program counter is initialized to this address value.
Typically, the reset handler performs some HW initialization, then calls the main function.
If the interrupt arrives, the interrupt controller will read the address of the interrupt handler which is stored in the IVT. Then, set the program counter to that value.
This way forces the processor to jump to the ISR.

Before jumpping to the ISR, the interrupt controller perform stacking to reserve the program's status.
Note, ARM uses decending stack, if a 32 bits item is pushed to the stack, the SP(stack pointer) is decremented by 4.

The ISR completes its execution by execute this instruction:


BX LR

The above instruction informs the interrupt controller to perform an unstacking process.

10. Interrupt Enable and Interrupt Priority

A Cortex M microcontroller support up to 256 interrupts.

Each interrupt, except the interrupt reset, has an interrupt number.
The first 16 interrupts are system interrupts, also called system exceptions.

negative values

The reset 240 interrupts are peripherial interrupts, also called non-system exceptions

Several CMSIS use the interrupt number as an input parameter,


NVIC_DisableIRQ(IRQn);            // disable interrupt
NVIC_EnableIR(IRQn);              // enable interrupt
NVIC_ClearingPending(IRQn);       // clear pending status
NVIC_SetPriority(IRQn, priority); // set priority level

When an interrupt is serviced, the current interrupt or exception number is recorded in the program status register(PSR).
The recorded value in PSR is different from the number in CMSIS,

In this tutorial, when we say interrupt number, we mean the interrupt number defined for CMSIS.
This is the interrupt number definition for STM32L4 Cortex-M4 microprocessors, it is always defined in a header file:

Enabling a system exception is different from enabling a peripherial interrupt.
There is no enabling/disabling rsgisters for system excptions:

Some system exceptions, such as reset and hard fault, cannot be disabled. They are always enabled.
The other system exceptions can be enabled or disabled by the corresponding modules, such as system timer

On the other hand, the enabling/disabling peripherial interrupts are implemented by modifying 2 sets of registers: ISER(interrupt set enable register) and ICER registers.
We can enable a peripherial interrupt by writing 1 to the corresponding bit of the ISER register.
For ex., to enable interrupt Timer 7,

the interrupt number of Timer 7 is 44 for STM32L1
we need to set bit 12 of ISER1 to 1

Similarly, we can disable interrupt Timer 7,

What should the processor do if multiple interrupts arrive at the same time?
ARM processor allow software to set priority levels for almost every interrupt.
In ARM, numerically low priority values are used to specify logically high interrupt priorities.
The priority of some interrupts are fixed.

ARM Cortex-M processors use a byte to represent the priority level.
Interrupt priority is configured by Interrupt Priority(IP) register.

In embedded systems, we often have to perform some critical operations, in which data should not be corrupted by other interrupts.
Therefore, we need to disable all interrupts with less urgency to ensure that the execution of the critical code will not be interrupted by other interrupts.
We can use the Base Priority Mask Register(BASEPRI) to achieve the protection of critical code.
In this ex., we disable all interrupts whose priority is >= 5 during the execution of the critical code.


__set_BASEPRI( 5 << 4 )
// critical code start
..
// critical code end
__set_BASEPRI(0)

11. External interrupts (EXTI)

This lecture will show you how to configure and program external interrupt(EXTI).

External interrupts are generated by peripherals or devices, external to the microcontroller, such as push buttons and key pads.
There are 2 approaches to monitor and respond to external events.

polling
interrupt

An interrupt is essentially a HW-triggered SW action.
The interrupt controller :

temporally stops the normal flow of program execution
causes the interrupt service routine(ISR) to be executed

When there are no interrupt events, the processor runs the normal program or enters a sleep state to conserve energy.

Use STM32L4 Kit as an ex.,

GPIO port A's pins PA0, PA1, PA5, PA2 and PA3 are connected to the "center", "left", "down", "right", and "u"p pin of the jpystick respectively.
Each ping is connected to the ground via a capacitor.
These capacitors perform HW switch debouncing.

When the "up" of the joystick is pressed, this switch is then closed.
As a result, PA#3 is then connected to the 3V via the "COMMON" terminal.
Note that:

the default voltage of the "CENTER" pin is 0 because of the pull down register R59.
The other 4 joystick terminals are not pull down.

We can use external interrupts to monitor whether the joystick is pressed or not.
Each GPIO pin can trigger an interrupt request signal independently.
SW can configure the external interrupt controller so that:

PA#0 triggers EXTI0
PA#1 triggers EXTI1
PA#5 triggers EXTI5
PA#2 triggers EXTI2
PA#3 triggers EXTI3

How to configure the source of the external interrupt controller?
The external interrupt controller monitors the change of the voltage signal.
The rising or falling edge of the voltage signal can make the external interrupt controller generate an interrupt request.
The interrupt request will be sent to the NVIC.
The external interrupt controller supports 16 external interrupt input, these inputs are named from external interrupt 0 to 15 and associated with GPIO pins.
Each interrupt input is associated with one specific GPIO port's pin.
Multiple GPIO port's pins can be used as the input interrupt source simultaneously.
Therefore, we can use only specific GPIO pin number from GPIO ports at the same time.
The interrupt controller has one multiplexer for each GPIO pin. There are 16 multiplexers.
A multiplexer(MUX) is a simple circuit. It selects one of its inputs and forwards it ti the output.
There are dedicated interrupt handlers for external interrupts.
For ex.,

PA.3 can be mapped to EXTI3 and its corresponding interrupt handler is EXT_3_IRQHandler.
External interrupts from number 5 to 0, share the same interrupt handler EXT_9_5_IRQHandler
External interrupts from number 10 to 15, share the same interrupt handler EXT_15_10_IRQHandler

The external interrup controller supports 2 types of interrupts:

configurable external interrupts

direct external interrupts

An interrupt can pass this AND gate if and only if the bit from the Interrupt Mask Register(IMR) register is 1.

Let's work on the SW part: if we press the "UP" button of th ejoystick, SW turns on the LED
The "UP" butto is connected to the GPIO pin PA3 which can generate the external interrupt request 3.

First, we need to enable the GPIO port A.


RCC->AHB2ENR |= RCC_AHB2ENR_GPIOAEN;

Then, configure the mode of pin PA.3 as the digital input.


// GPIO mode: digital input(00), digital output(01), alternative function(10), analog(11, default).
GPIOA->MODER &= ~3U << 6;

Set PA.3 as pull down


// GPIO non pull-up , pull-down(00), pull-up(01), pull-down(10), reserved(11)
GPIO->PUPDR &= ~3U << 6;
GPIO->PUPDR |= 2U << 6;     // pull-down(10)

enable external interrupt 3


NVIC_EnableIRQ(EXTI3_IRQn);

select PA.3 as the source of external interrupt 3


RCC->APB2ENR != RCC_APB2ENR_SYSFGGEN;
SYSCFG->EXTICR[0] &= ~SYSCFG_EXTICR1_EXTI3;
SYSCFG->EXTICR[0] |= SYSCFG_EXTICR1_EXTI3_PA;

rising edge trigger selection


// 0: trigger disabled, 1: trigger enabled
EXTI->RTSR1 != EXTI->RTSR1_RT3;

set the interrupt mask register


// 0: masked, 1: not masked
EXTI->IMR1 != EXTI->IMR1_IM3;

ISR for external interrupt 3


void EXTI3_IRQHandler(void){
    if ((EXTI->PR1 & EXTI_PR1_PIF3) != 0) {
        // toggle LED
        ..
        // clear interrupt flag
        EXTI->PR1 |= EXTI_PR1_PIF3;
    }
}

12. System Timer (SysTick)

13. Timer PWM output

14. Timer Input Capture

15. Booting Process

16. Volatile Variables

17. Race Condition

18. ADC

19. Floating-Point Unit (FPU)

20. Fixed Point Numbers

21. Why learn assembly language

22. Big Endian and Little Endian

23. Load and Store Instructions

24. Addressing mode: pre-index, post-index, and pre-index with update

25. Arithmetic and Logical Instructions

26. Updating NZCV bit flags

27. Branch Instructions

28. Conditional Execution

29. Calling a subroutine

30. Passing arguments to a subroutine

31. Preserving registers in a subroutine

32. Mixing C and assembly

SoC, MPU, MCU

Microcontrollers vs. Microprocessors: What’s the difference?

Microcontrollers (MCUs) tend to be less expensive than, simpler to set-up, and simpler to operate than microprocessors (MPUs).
An MCU can be viewed as a single-chip computer, whereas an MPU has surrounding chips that support various functions like memory, interfaces, and I/O.

One of the main differences between microcontrollers and microprocessors is that

a microprocessor will typically run an operating system.

An operating system

A microcontroller will run a “bare metal interface,” which means there is not an operating system.

a single thread

MCUs only have basic options for interfacing with the outside world.
An MCU might have I2C, SPI, a UART (serial), and sometimes a low-level USB connection.
These basic interfaces are often used just for programming the MCU.

An MCU provides more on a single chip than an MPU.

The difference between MCUs and MPUs is becoming less pronounced since some MCUs now come with simple software drivers for more sophisticated peripherals and more MPUs can be found that have integrated peripherals on-chip.

SoC

An SoC( System-on-a-Chip ) can be based on an MCU or MPU and will provide everything that’s necessary to perform certain types of applications.
SoCs enable an entire system of chips on a single, tiny IC.
For example, for image processing, an SoC might have a combination of

MPU
a Digital Signal Processor (DSP)
a Graphic Processing Unit (GPU) for performing rapid algorithm calculations, along with on-chip interfaces for driving a display and an HDMI or other audio/video input/output technology.

ARM Instruction Set

基本上 ARM 處理器具有 16 個 32 bit 長度的暫存器，其中有 13 個為通用暫存器 (General Purpose Registers, GPRs) ， R13-R15 則有其他用途。

學組語的目的，不見得是為了改善效能，而是判斷 optimizing compiler 產生的機械碼是否正確
Basic Syntax


label
    opcode operand1, operand2, ...; Comments

lable
opcode
operand

ARM programmer model

The state of an ARM system is determined by the content of visible registers and memory.
A user-mode program can see 15 32-bit general- purpose registers (R0-R14), program counter (PC) and CPSR.
Instruction set defines the operations that can change the state.

An Instruction Set Architecture (ISA) is part of the abstract model of a computer that defines how the CPU is controlled by the software.
The ISA defines the supported data types, the registers, how the hardware manages main memory, key features (such as virtual memory), which instructions a microprocessor can execute, and the input/output model of multiple ISA implementations.
ARM instructions are all 32-bit long (except for Thumb mode).
There are 232 possible machine instructions. Fortunately, they are structured.

Regarding registers, briefly:

r0
r1-r3
r4-r11
r12
r13
r14
r15

Instruction set:

Data processing

Data movement
Flow control

C6: A64 Base Instruction Descriptions

C6.2.173 MRS(Move System Register)

To read an AArch64 System register into a general-purpose register.

C6.2.175 MSR (register)

To write an AArch64 System register from a general-purpose register.

What is the purpose of WFI and WFE instructions and the event signals?

We have 2 instructions for entering low-power standby state where most clocks are gated: WFI and WFE.

WFI is targeted at entering either standby, dormant or shutdown mode, where an interrupt is required to wake-up the processor.
WFE makes use of the event register, the SEV instruction and EVENTI, EVENTO signals.

WFE

STANDBYWFE

SEV

STANDBYWFE

RASPBERRY PI ON QEMU

Emulate Raspberry Pi 3 using QEMU in 64 bit

學習實作小型作業系統

Low-Level Programming University

ARM Cortex-A Series Programmer's Guide for ARMv7-A

Preface

The purpose of this book is to provide a single guide for programmers who want to develop applications for the Cortex-A series of processors, bringing together information from a wide variety of sources that will be useful to both assembly language and C programmers.
Hardware concepts such as caches and Memory Management Units are covered, but only where this is valuable to the application writer.
We will also look at the way operating systems such as Linux make use of ARM features, and how to take full advantage of the capabilities of the ARM processor, in particular writing software for multi-core processors.

This is not an introductory level book. It assumes some knowledge of the C programming language and microprocessors, but not of any ARM-specific background.
We hope that the book is suitable for programmers who have a desktop PC or x86 background and are taking their first steps into the ARM processor based world.

Chapter 1 Introduction

Chapter 2 ARM Architecture and Processors

Chapter 3 ARM Processor Modes and Registers

The ARM architecture is a modal architecture.
Before the introduction of Security Extensions it had seven processor modes: six privileged modes and a non-privileged user mode.

User (USR)
FIQ
IRQ
Supervisor (SVC)
Abort (ABT)
Undef (UND)
System (SYS)

Privilege is the ability to perform certain tasks that cannot be done from User (Unprivileged) mode.
For ex., the user mode cannot do MMU configuration and cache operations.
Modes are associated with exception events, which are described in Exception Handling.

The introduction of the TrustZone Security Extensions created two security states for the processor that are independent of Privilege and processor mode, with a new Monitor mode to act as a gateway between the Secure and Non-secure states and modes existing independently for each security state.

For processors that implement the TrustZone extension, system security is achieved by dividing all of the hardware and software resources for the device.
When a processor is in the Non-secure state, it cannot access the memory that is allocated for Secure state.
In this situation the Secure Monitor acts as a gateway for moving between these two worlds. Software executing in Monitor mode controls transition between Secure and Non-secure processor states.

The ARMv7-A architecture Virtualization Extensions add a hypervisor mode (Hyp), in addition to the existing privileged modes.
Virtualization enables more than one Operating System to co-exist and operate on the same system.

If the Virtualization Extensions are implemented there is a privilege model.

In Non-secure state there can be three privilege levels, PL0, PL1 and PL2.

These privilege levels are separate from the TrustZone Secure and Normal (Non-secure) settings.
The privilege level defines the ability to access resources in the current security state, and does not imply anything about the ability to access resources in the other security state.

The presence of particular processor modes and states depends on whether the processor implements the relevant architecture extension(Virtualization, TrustZone)

The current processor mode and execution state is contained in the Current Program Status Register (CPSR).

Chapter 4

Generic Interrupt Controller (GIC)

A Generic Interrupt Controller (GIC) takes interrupts from peripherals, prioritizes them, and delivers them to the appropriate processor core.
The Arm GIC architecture has three forms in general use with the A-profile and R-profile processors.

1. Introduction

Terminology

About the Generic Interrupt Controller architecture

The GIC is a centralized resource for supporting and managing interrupts in a system that includes at least one processor.
It provides registers for managing interrupt sources, interrupt behavior, and interrupt routing to one or more processors.

The GIC includes interrupt grouping functionality that supports:

configuring each interrupt as either Group 0 or Group 1
signaling Group 0 interrupts to the target processor using either the IRQ or the FIQ exception request
signaling Group 1 interrupts to the target processor using the IRQ exception request only
a unified scheme for handling the priority of Group 0 and Group 1 interrupts
optional lockdown of the configuration of some Group 0 interrupts.

Security Extensions support

Virtualization support

Terminology

Interrupt states
Interrupt types
Models for handling interrupts
Spurious interrupts
Processor security state and Secure and Non-secure GIC accesses

Banking

Interrupt banking
Register banking

2. GIC Partitioning

About GIC partitioning

The GIC architecture splits logically into a Distributor block and one or more CPU interface blocks.
The GIC Virtualization Extensions add one or more virtual CPU interfaces to the GIC.

Distributor

registers

GICD_

CPU interfaces

registers

GICC_

Virtual CPU interfaces

Virtual interface control

registers

GICH_

Virtual CPU interface

registers

GICV_

The Distributor

The Distributor centralizes all interrupt sources, determines the priority of each interrupt, and for each CPU interface forwards the interrupt with the highest priority to the interface, for priority masking and preemption handling.

Interrupts from sources are identified using ID numbers. Each CPU interface can see up to 1020 interrupts.

CPU interfaces

Each CPU interface block provides the interface for a processor that is connected to the GIC.

3. Interrupt Handling and Prioritization

4. Programmers' Model

This chapter describes the Distributor and CPU interface registers.

The programmers' model for the GIC Distributor and CPU interfaces is to operate using a memory-mapped register interface.

About the programmers' model

GIC register names

Distributor register map

CPU interface register map

GIC register access

Enabling and disabling the Distributor and CPU interfaces

Effect of the GIC Security Extensions on the programmers' model

GICv3 and GICv4 Software Overview

1. Preface

1.3 Terms and Abbreviations

2. Introduction

2.4 Legacy support

The programmers’ model that is used is controlled by the Affinity Routing Enable (ARE) bits in GICD_CTRL :

When ARE == 0, affinity routing is disabled (legacy operation).
When ARE == 1, affinity routing is enabled.

This documents focusses on the new GICv3 programmers’ model, where ARE=1 for both security.

3. GICv3 fundamentals

3.1 Interrupts types

3.1.3 How interrupts are signaled to the interrupt controller

Traditionally, interrupts are signaled from a peripheral to the interrupt controller using a dedicated hardware signal.

GICv3 supports message-based interrupts.

an interrupt that is set and cleared by a write to a register in the interrupt controller

3.3 Affinity routing

GICv3 uses affinity routing to identify connected PEs and to route interrupts to a specific PE or group of PEs.
The affinity of a PE is represented as four 8-bit fields:


<affinity level 3>.<affinity level 2>.<affinity level 1>.<affinity level 0>

The affinity scheme matches that used in ARMv8-A, with the affinity of a PE reported in MPIDR_EL1.
System designers must ensure that the affinity value indicated by MPIDR_EL1 is identical to that indicated by GICR_TYPER for the Redistributor connected to the PE.

The exact meaning of the different levels of affinity is defined by the specific processor and SoC.
For ex.,


<group of groups> . <group of processors> .<processor> .<core>


<group of processors> .<processor> .<core> .<thread>

3.4 Security model

3.5 Programmers’ model

The register interface of a GICv3 interrupt controller is split into three groups:

Distributor interface(GICD_*).
Redistributor interface(GICR_*).
CPU interface(ICC_*_ELn).

Generic Timer

The Generic Timer provides a standardized timer framework for Arm cores.
The Generic Timer includes a System Counter and set of per-core timers,

The System Counter is an always-on device, which provides a fixed frequency incrementing system count.
The system count value is broadcast to all the cores in the system, giving the cores a common view of the passage of time.
Each core has a set of timers.

These timers are comparators, which compare against the broadcast system count that is provided by the System Counter.
Each timer has the following three system registers:

For example, CNTP_CVAL_EL0 is the Comparator register of the EL1 physical timer.

The CNTPCT_EL0 system register reports the current system count value.
CNTFRQ_EL0 reports the frequency of the system count. However, this register is not populated by hardware.

Timer virtualization

Timers can be divided into two groups: virtual timers and physical timers.

Physical timers

EL3

CNTPS

Virtual timers

EL1

CNTV


    Virtual Count = Physical Count - <offset>

The virtual count allows a hypervisor to show virtual time to a Virtual Machine (VM).
This means that the virtual count can represent time experienced by the VM, rather than wall clock time.

System Counter

The System Counter generates the system count value that is distributed to all the cores in the system.
This means that all cores share the same view of the passing of time.
Consider the following example:

Device A reads the current system count and adds it to a message as a timestamp, then sends the message to Device B.
When Device B receives the message, it compares the timestamp to the current system count.

In this example, the system count value that is seen by Device B can never be earlier than the timestamp in the message.

The System Counter measures real time.
The count must continue to increment at its fixed frequency.
The System Counter provides two register frames: CNTControlBase and CNTReadBase.

Registers

To download AArch64-Reference-Manual. This document contains the detailed specification of the ARM.v8 architecture.

CNTFRQ_EL0, Counter-timer Frequency register

This register is provided so that software can discover the frequency of the system counter.
It must be programmed with this value as part of system initialization.
The value of the register is not interpreted by hardware.

CNTFRQ_EL0 is a 64-bit register.

AArch64 System register CNTFRQ_EL0 bits [31:0] are architecturally mapped to AArch32 System register CNTFRQ[31:0].
Bits [31:0] ndicates the system counter clock frequency, in Hz.

CNTPCT_EL0, Counter-timer Physical Count register

This holds the 64-bit physical count value.

CNTVCT_EL0, Counter-timer Virtual Count register

This holds the 64-bit virtual count value.
The virtual count value is equal to the physical count value visible in CNTPCT_EL0 minus the virtual offset visible in CNTVOFF_EL2.
This register can be read using MRS with the following syntax:


MRS <Xt>, <systemreg>

CNTVOFF_EL2, Counter-timer Virtual Offset register

This holds the 64-bit virtual offset.
This is the offset between the physical count value visible in CNTPCT_EL0 and the virtual count value visible in CNTVCT_EL0.


MRS <Xt>, <systemreg>

MIDR, Main ID Register

Provides identification information for the PE, including an implementer code for the device and a device ID number.
There is one instance of this register that is used in both Secure and Non-secure states.
Some fields of the MIDR are IMPLEMENTATION DEFINED.

Implementer, bits [31:24]

Variant, bits [23:20]
Architecture, bits [19:16]
PartNum, bits [15:4]
Revision, bits [3:0]

System Control Register (SCTLR)

The SCTLR provides the top level control of the system, including its memory system.

EE, bit [25]

I, bit [12]

C, bit [2]

Cacheability

Non-cacheable

can be cached

M, bit [0]

MMU

disabled

enabled

SCTLR_EL1, System Control Register (EL1)

Provides top level control of the system, including its memory system, at EL1 and EL0.

AArch64 System register SCTLR_EL1 bits [31:0] are architecturally mapped to AArch32 System register SCTLR[31:0].

DSSBS, bit [44]

When FEAT_SSBS is implemented

Otherwise

SSBS, Speculative Store Bypass Safe

This register is present only when FEAT_SSBS is implemented. Otherwise, direct accesses to SSBS are UNDEFINED.

HCR_EL2, Hypervisor Configuration Register (EL2)

Provides configuration controls for virtualization, including defining whether various Non-secure operations are trapped to EL2.

RW, bit [31]

SCR_EL3, Secure Configuration Register (EL3)

Defines the configuration of the current Security state. It specifies:

The Security state of EL0 and EL1, either Secure or Non-secure.
The Execution state at lower Exception levels.
Whether IRQ, FIQ, SError interrupts, and External abort exceptions are taken to EL3.

RW, bit [10]

If EL2 is present:

EL2 is AArch64.
EL2 controls EL1 and EL0 behaviors.

If EL2 is not present:

EL1 is AArch64.
EL0 is determined by the Execution state described in the current process state when executing at EL0.

Bits [5:4]
NS, bit [0]

The AT S1E2R, AT S1E2W, TLBI VAE2, TLBI VALE2, TLBI VAE2IS, TLBI VALE2IS, TLBI ALLE2, and TLBI ALLE2IS System instructions are UNDEFINED.
Each AT S12E** System instruction executes as the corresponding AT S1E**instruction.
Each of the TLBI IPAS2E1, TLBI IPAS2E1IS, TLBI IPAS2LE1, and TLBI IPAS2LE1IS System instructions executes as a NOP.
A TLBI VMALLS12E1 System instruction executes as TLBI VMALLE1, and a TLBI VMALLS12E1IS System instruction executes as TLBI VMALLE1IS.

SPSR_EL3, Saved Program Status Register (EL3)

Holds the saved process state when an exception is taken to EL3.

ACTLR, Auxiliary Control Register

AArch32 System register ACTLR provides IMPLEMENTATION DEFINED configuration and control options for execution at EL1 and EL0.
ACTLR is a 32-bit register, and is part of:

The Other system control registers functional group.
The Implementation defined functional group.

ACTLR_EL1, Auxiliary Control Register (EL1)

Provides IMPLEMENTATION DEFINED configuration and control options for execution at EL1 and EL0.
ACTLR_EL1 is a 64-bit register

ACTLR_EL2, Auxiliary Control Register (EL2)

Provides IMPLEMENTATION DEFINED configuration and control options for EL2.

ACTLR_EL3, Auxiliary Control Register (EL3)

Provides IMPLEMENTATION DEFINED configuration and control options for EL3.
ACTLR_EL3 is a 64-bit register.

MPIDR_EL1, Multiprocessor Affinity Register, EL1

The MPIDR_EL1 provides an additional core identification mechanism for scheduling purposes in a cluster.
Configuration of what a processing element (PE) is in an ARM core or cluster is defined by the MPIDR system register.
The format of this is as follows (for AArch64):

The MPIDR_EL1 enables software to determine on which core it is executing.
This register has a different value for each processing element in the system.

RES0, [63:40]
Aff3, [39:32]
RES1, [31]
U, [30]

[29:25]
MT, [24]
Aff2, [23:16]
Aff1, [15:12]

Aff1, [11:8]

Aff0, [7:0]

single-threaded

0x00

Physical CPU can have several cores, a CPU core is a physical prosessing unit.
各個core之間是相互獨立，且可以並行執行邏輯的，每個core都有自己單獨的暫存器，l1, l2 快取等物理硬體。
intel又在core的基礎上提出了hyper-threading概念，即一個core裡可以模擬多個邏輯核，這個就叫做thread。
Thread is a logical processing unit which is implemented by software logic.
The affinity fields give a hierarchical description of the core's location relative to other cores.
Typically,

Affinity 0 is the core ID within the cluster
Affinity 1 is the cluster ID.


    // 读取当前CPUID,如果id不为0(primary core),使其跳至halt休眠
    // mrs -- Move the contents of a special register to a general-purpose register.
    // mpidr_el1 用来读取核心ID用
    mrs     x1, mpidr_el1
    and     x1, x1, #0xFF // CPU number is in MPIDR Affinity Level 0
    cbnz    x1, halt // Hang for all non-primary CPU

arch/arm64/include/asm/sysreg.h



#define read_sysreg_s(r) ({						\
	u64 __val;							\
	asm volatile(__mrs_s("%0", r) : "=r" (__val));			\
	__val;								\
})

arch/arm64/include/asm/cputype.h


#define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)

arch/arm/include/asm/cputype.h



#define CPUID_MPIDR	5

static inline unsigned int __attribute_const__ read_cpuid_mpidr(void)
{
	return read_cpuid(CPUID_MPIDR);
}

ARM GCC Inline Assembler Cookbook

The GNU C compiler for ARM RISC processors offers, to embed assembly language code into C programs.

GCC asm statement

With inline assembly you can use the same assembler instruction mnemonics as you'd use for writing pure ARM assembly code.

Basic inline assembly syntax


__asm [volatile] (code);

code is the assembly instruction.
For ex.,


/* NOP example */
asm("mov r0,r0");

You can write more than one assembler instruction in a single inline asm statement.


asm(
"mov     r0, r0\n\t"
"mov     r0, r0\n\t"
"mov     r0, r0\n\t"
"mov     r0, r0"
);

Extended inline assembly syntax

However, registers and constants are specified in a different way, if they refer to C expressions.

  
__asm  [volatile] ( code_template 
					: output operand list 
                    : input operand list 
                    : clobber list);

code_template is a template for an assembly instruction.
The connection between assembly language and C operands is provided by an optional second and third part of the asm statement, the list of output and input operands.

Each operand consists of a symbolic name in square brackets
a constraint string

"=r" for the output operands
"r" for the output operands

a C expression in parentheses.

For ex.,

  
/* Rotating bits example */
asm("mov %[result], %[value], ror #1" :: [result] "=r" (y) : [value] "r" (x));

The following example sets the current program status register of the ARM CPU. It uses an input, but no output operand.

  
asm ("msr cpsr,%[ps]" 
     :: 
     :: [ps] "r" (status));

ARM Trusted Firmware Porting Guide

Introduction

Porting the ARM Trusted Firmware to a new platform involves making some mandatory and optional modifications for both the cold and warm boot paths.

Common Modifications

Common mandatory modifications

A platform port must enable the Memory Management Unit (MMU) with identity mapped page tables, and enable both the instruction and data caches for each BL stage.
In the ARM FVP port, each BL stage configures the MMU in its platform- specific architecture setup function, for example blX_plat_arch_setup().

2.2 Handling reset

BL1 by default implements the reset vector where execution starts from a cold or warm boot.
BL3-1 can be optionally set as a reset vector using the RESET_TO_BL31 make variable.

2.3 Common optional modifications

The following are helper functions implemented by the firmware that perform common platform-specific tasks.

int platform_get_core_pos(unsigned long)


linear index = cpu_id + (cluster_id * 4)

cpu_id = 8-bit value in MPIDR at affinity level 0
cluster_id = 8-bit value in MPIDR at affinity level 1

3 Boot Loader stage specific modifications

3.1 Boot Loader stage 1 (BL1)

3.2 Boot Loader stage 2 (BL2)

3.3 Boot Loader stage 3-1 (BL3-1)

3.3.1 Power State Coordination Interface (in BL3-1)

The ARM Trusted Firmware's implementation of the PSCI API is based around the concept of an affinity instance.
Each affinity instance can be uniquely identified in a system by a CPU ID (the processor MPIDR is used in the PSCI interface) and an affinity level.

CPU affinity enables binding a process or multiple processes to a specific CPU core in a way that the process(es) will run from that specific core only.
When trying to perform performance testing on a host with many cores, it is wise to run multiple instances of a process, each one on different core.
This enables higher CPU utilization.

PSCI implementation (in BL3-1)

Interrupt Management framework (in BL3-1)

Crash Reporting mechanism (in BL3-1)

C Library

Storage abstraction layer

Fixed Virtual Platforms(FVP)

Fixed Virtual Platforms (FVPs) are complete simulations of an Arm system, including processor, memory and peripherals.
These are set out in a "programmer's view", which gives you a comprehensive model on which to build and test your software.

Learning operating system development using Linux kernel and Raspberry Pi

Introduction

Contribution guide

Prerequisites

Lesson 1: Kernel Initialization

1.1 Introducing RPi OS, or bare metal “Hello, world!” Linux 1.2 Project structure 1.3 Kernel build system 1.4 Startup sequence 1.5 Exercises

Lesson 2: Processor initialization

2.1 RPi OS

Exception levels

Each ARM processor that supports ARM.v8 architecture has 4 exception levels.
You can think about an exception level (or EL for short) as a processor execution mode in which only a subset of all operations and registers is available.
The least privileged exception level is level 0. When processor operates at this level, it mostly uses only general purpose registers (X0 - X30) and stack pointer register (SP). EL0 also allows using STR and LDR commands to load and store data to and from memory and a few other instructions commonly used by a user program.

An operating system should deal with exception levels because it needs to implement process isolation.
A user process should not be able to access other process’s data.
To achieve such behavior, an operating system always runs each user process at EL0.
Operating at this exception level a process can only use it’s own virtual memory and can’t access any instructions that change virtual memory settings.
So, to ensure process isolation, an OS need to prepare separate virtual memory mapping for each process and put the processor into EL0 before transferring execution to a user process.

An operating system itself usually works at EL1.
While running at this exception level processor gets access to the registers that allows configuring virtual memory settings as well as to some system registers. Raspberry Pi OS also will be using EL1.

EL2 is used in a scenario when we are using a hypervisor.
In this case host operating system runs at EL2 and guest operating systems can only use EL 1.
This allows host OS to isolate guest OSes in a similar way how OS isolates user processes.

EL3 is used for transitions from ARM “Secure World” to “Insecure world”.
This abstraction exist to provide full hardware isolation between the software running in two different “worlds”.
Application from an “normal world” has no way to access or modify information (both instruction and data) that belongs to “Secure world”, and this restriction is enforced at the hardware level.

Finding current Exception level

A small function can figure out at which exception level is:


.globl get_el
get_el:
    mrs x0, CurrentEL
    lsr x0, x0, #2
    ret

Here we use mrs instruction to read the value from CurrentEL system register into x0 register.
Then we shift this value 2 bits to the right (we need to do this because first 2 bits in the CurrentEL register are reserved and always have value 0).
And finally in the register x0 we have an integer number indicating current exception level.
To display this value,


    int el = get_el();
    printf("Exception level: %d \r\n", el);

Changing current exception level

In ARM architecture there is no way how a program can increase its own exception level without the participation of the software that already runs on a higher level.
Current EL can be changed only if an exception is generated. This can happen if:

a program executes some illegal instruction
an application can run svc instruction to generate an exception on purpose
a hardware interrupt

Whenever an exception is generated the following sequence of steps takes place : (assuming that the exception is handled at EL n)

Address of the current instruction is saved in the ELR_ELn register. ( Exception link register )
Current processor state is stored in SPSR_ELn register (Saved Program Status Register)
An exception handler is executed and does whatever job it needs to do.
Exception handler calls eret instruction.

SPSR_EL

ELR_EL

An important thing to know is that :

exception handler is not obliged to return to the same location from which the exception originates.
Both ELR_ELn and SPSR_ELn are writable and exception handler can modify them if it wants to.

We are going to use this technique to our advantage when we try to switch from EL3 to EL1 in our code.

Switching to EL1

Strictly speaking, operating system is not obliged to switch to EL1, but EL1 is a natural choice because this level has just the right set of privileges to implement all common OS tasks.


#include "arm/sysregs.h"

#include "mm.h"

.section ".text.boot"

.globl _start
_start:
	mrs	x0, mpidr_el1		
	and	x0, x0,#0xFF		// Check processor id
	cbz	x0, master		// Hang for all non-primary CPU
	b	proc_hang

proc_hang: 
	b 	proc_hang

master:
	ldr	x0, =SCTLR_VALUE_MMU_DISABLED
	msr	sctlr_el1, x0		

	ldr	x0, =HCR_VALUE
	msr	hcr_el2, x0

	ldr	x0, =SCR_VALUE
	msr	scr_el3, x0

	ldr	x0, =SPSR_VALUE
	msr	spsr_el3, x0

	adr	x0, el1_entry		
	msr	elr_el3, x0

	eret				

el1_entry:
	adr	x0, bss_begin
	adr	x1, bss_end
	sub	x1, x1, x0
	bl 	memzero

	mov	sp, #LOW_MEMORY
	bl	kernel_main
	b 	proc_hang		// should never come here

Analysis:

sctlr_el1

sctlr_el

parameters of the processor

sctlr_el

accessible from all exception levels higher or equal than EL1


// Some bits in the description of sctlr_el1 register are marked as RES1. 
// Those bits are reserved for future usage and should be initialized with 1.
#define SCTLR_RESERVED               (3 << 28) | (3 << 22) | (1 << 20) | (1 << 11)

// This field controls endianess of explicit data access at EL1.
// We are going to configure the processor to work only with little-endian format.
#define SCTLR_EE_LITTLE_ENDIAN          (0 << 25)
// this one controls endianess of explicit data access at EL0
#define SCTLR_EOE_LITTLE_ENDIAN         (0 << 24)

// Disable instruction cache.
#define SCTLR_I_CACHE_DISABLED          (0 << 12)

// Disable data cache.
#define SCTLR_D_CACHE_DISABLED          (0 << 2)

// Disable MMU.
#define SCTLR_MMU_DISABLED            (0 << 0)
#define SCTLR_MMU_ENABLED             (1 << 0)

#define SCTLR_VALUE_MMU_DISABLED	(SCTLR_RESERVED | SCTLR_EE_LITTLE_ENDIAN | SCTLR_I_CACHE_DISABLED | SCTLR_D_CACHE_DISABLED | SCTLR_MMU_DISABLED)

hcr_el2

execution state


#define HCR_RW	    			(1 << 31)
#define HCR_VALUE			HCR_RW

scr_el3


#define SCR_RESERVED	    		(3 << 4)
#define SCR_RW				(1 << 10)
#define SCR_NS				(1 << 0)
#define SCR_VALUE	    	    	(SCR_RESERVED | SCR_RW | SCR_NS)

spsr_el3

processor state

eret

Processor state

Condition Flags

Interrupt disable bits
Some other information, required to fully restore the processor execution state after an exception is handled.

writable


// After we change EL to EL1 all types of interrupts will be masked (or disabled, which is the same).        
#define SPSR_MASK_ALL 			(7 << 6)
// At EL1 we can either use our own dedicated stack pointer or use EL0 stack pointer.
// EL1h mode means that we are using EL1 dedicated stack pointer.
#define SPSR_EL1h			(5 << 0)
#define SPSR_VALUE			(SPSR_MASK_ALL | SPSR_EL1h)

ELR_EL3

2.2 Linux

2.3 Exercises

Lesson 3: Interrupt handling

3.1 RPi OS Linux 3.2 Low level exception handling 3.3 Interrupt controllers 3.4 Timers 3.5 Exercises

Lesson 4: Process scheduler

4.1 RPi OS Linux 4.2 Scheduler basic structures 4.3 Forking a task 4.4 Scheduler 4.5 Exercises

Lesson 5: User processes and system calls

5.1 RPi OS 5.2 Linux 5.3 Exercises

Lesson 6: Virtual memory management

6.1 RPi OS 6.2 Linux (In progress) 6.3 Exercises

Lesson 7: Signals and interrupt waiting (To be done)

Lesson 8: File systems (To be done)

Lesson 9: Executable files (ELF) (To be done)

Lesson 10: Drivers (To be done)

Lesson 11: Networking (To be done)

嵌入式系統建構：開發運作於STM32的韌體程式

Programming with 64-Bit ARM Assembly Language

Single Board Computer Development for Raspberry Pi and Mobile Devices
Stephen Smith

Introduction

This book delves into how these are programmed at the bare metal level and provides insight into their architecture.
Knowing how the processor works will let you write more efficient C code.
Source Code Location: https://github.com/Apress/Programming-with-64-Bit-ARM--Assembly-Languag

CHAPTER 1 Getting Started

The idea was to use reduced instruction set computer (RISC) technology as opposed to complex instruction set computer (CISC) .
Writing in Assembly is harder, as you must solve problems with memory addressing and CPU registers that is all handled transparently by high- level languages.

Hardware

Broadcom BCM2711, 四核Cortex-A72 (ARM v8) 64位元 1.5GHz處理器
4GB LPDDR4-3200 SDRAM
2.4 GHz/5.0 GHz IEEE 802.11b/g/n/ac 無線網路, 藍牙 5.0 BLE
Gigabit Ethernet
2個USB 3.0埠; 2個USB 2.0埠
Raspberry Pi標準40 pin GPIO排針擴充板插座
2個micro-HDMI埠 (可達4K60幅顯示輸出)
2-lane MIPI DSI顯示埠
2-lane MIPI CSI相機埠
4-pole 立體聲音和複合視訊埠
H264 (1080p60解碼, 1080p30編碼)
OpenGL ES 3.0 graphics
Micro-SD卡插槽
5V DC 可經由USB-C插座輸入 (最小3A)
5V DC 可經由GPIO插座輸入 (最小3A)
5V DC 可經由PoE輸入 (需要另外安裝PoE擴充板)
工作環境溫度: 0 - 50 度C

Software

Raspberry Pi OS with desktop

Downloading Installing the Operating System

Raspberry Pi OS (Legacy) with desktop


https://downloads.raspberrypi.org/raspios_oldstable_armhf/images/raspios_oldstable_armhf-2022-04-07/2022-04-04-raspios-buster-armhf.img.xz

Raspberry Pi OS Lite


https://downloads.raspberrypi.org/raspios_lite_armhf/images/raspios_lite_armhf-2022-04-07/2022-04-04-raspios-bullseye-armhf-lite.img.xz

Raspberry Pi OS with desktop


https://downloads.raspberrypi.org/raspios_armhf/images/raspios_armhf-2022-04-07/2022-04-04-raspios-bullseye-armhf.img.xz

Raspberry Pi OS with desktop and recommended software(32 bits)


https://downloads.raspberrypi.org/raspios_full_armhf/images/raspios_full_armhf-2022-04-07/2022-04-04-raspios-bullseye-armhf-full.img.xz

Raspberry Pi OS with desktop(64 bits)


https://downloads.raspberrypi.org/raspios_arm64/images/raspios_arm64-2022-04-07/2022-04-04-raspios-bullseye-arm64.img.xz

Installing the Operating System


$ sudo apt install rpi-imager


$ sudo dd if=2021-10-30-raspios-bullseye-armhf.img of=/dev/sdX bs=4M conv=fsync

Use a USB stick as the system partitions


$ tree /dev/disk/
...
└── by-uuid
    ├── 137e2641-afc7-4d05-bfbf-a40cad4f8261 -> ../../sda1  (swap, 8G)
    ├── ec464b47-461d-4b86-acc8-6ab342d6a8e3 -> ../../sda2  (/usr, 8G)
    ├── cff70456-f637-4eed-945b-3c95a8bc48db -> ../../sda3  (/opt, 1G)
    ├── 69b33879-49f9-4d4e-b787-07b0b60211ba -> ../../sda5  (/var, 2G)
    └── f0d406f3-3da0-4fa7-8aa7-9eaf2b74047e -> ../../sda6  (/home, 9.6G)

$ sudo mkswap /dev/sda1
$ sudo swapon -U 137e2641-afc7-4d05-bfbf-a40cad4f8261

$ cat /etc/fstab
proc            /proc           proc    defaults          0       0
PARTUUID=003e8b7d-01  /boot           vfat    defaults,flush    0       2
PARTUUID=003e8b7d-02  /               ext4    defaults,noatime  0       1
# a swapfile is not a swap partition, no line here
#   use  dphys-swapfile swap[on|off]  for that
# /dev/sda1 for swap
UUID=137e2641-afc7-4d05-bfbf-a40cad4f8261    none    swap    sw      0   0
# /dev/sda2 for /usr 
UUID=ec464b47-461d-4b86-acc8-6ab342d6a8e3   /usr   ext4   defaults   0       2
# /dev/sda3 for /opt
UUID=cff70456-f637-4eed-945b-3c95a8bc48db   /opt   ext4   defaults   0       2
# /dev/sda5 for /var
UUID=69b33879-49f9-4d4e-b787-07b0b60211ba   /var   ext4   defaults   0       2
# /dev/sda6 for /home
UUID=f0d406f3-3da0-4fa7-8aa7-9eaf2b74047e   /home  ext4   defaults   0       2

安裝酷音輸入法，輸入指令


$ sudo apt-get -y install scim-chewing

If you are installing Raspberry Pi OS Lite and intend to run it headless, you will still need to create a new user account. Since you will not be able to create the user account on first boot, you MUST configure the operating system using the Advanced Menu.

Ubuntu


$ xzcat /home/jerry/Downloads/ubuntu-22.04-preinstalled-desktop-arm64+raspi.img.xz | sudo dd of=/dev/sdd bs=32M; sync

Kali Linux

Kali Linux works very well and will be using it to test all the programs in this book.

Kali Linux contains several hundred tools targeted towards various information security tasks, such as Penetration Testing, Security Research, Computer Forensics and Reverse Engineering.

To install a pre-built image of the standard build of Kali Linux on your Raspberry Pi 4, follow these instructions:

Get a fast microSD card with at least 16GB capacity.
Download and validate our preferred Kali Raspberry Pi 4 image from the downloads area.
Use the dd utility to image this file to your microSD card (same process as making a Kali USB.


$ xzcat kali-linux-2022.1-raspberry-pi-arm64.img.xz | sudo dd of=/dev/sdd bs=4M status=progress

Once the dd operation is complete, boot up the Raspberry Pi 4 with the microSD plugged in.
You should be able to log in to Kali.


    User: kali
    Password: kali

Enable ssh login:

Install Kali Linux remote SSH-OpenSSH server


$ sudo apt-get install ssh  
$ sudo service ssh start

Enable Kali Linux Remote SSH Service


$ sudo update-rc.d -f ssh remove
$ sudo update-rc.d -f ssh defaults

check whether the service is running.

  
$ sudo apt-get install chkconfig
$ sudo chkconfig -l ssh

Default Tool Credentials

ARM Assembly Instructions

The ARM is what is called a RISC computer, there are fewer instructions and each one is simple, so the processor can execute each instruction quickly.

CPU Registers

The registers are part of the CPU circuitry allowing instant access, whereas memory is a separate component and there is a transfer time for the CPU to access it.
In all computers, data is not operated in the computer’s memory; instead it’s loaded into a CPU register, then the data processing or arithmetic operation is performed in the registers.

If you want to add two numbers, you might do the following:

Load one into one register and the other into another register.
Perform the add operation putting the result into a third register.
Copy the answer from the results register into memory.

A 64-bit program on an ARM processor in user mode can see:

X0–X30

general purpose

SP, XZR

depending on the context

X30, LR

hold the return address

avoid using this register

currently executing instruction

All the X registers can be operated on as 32-bit registers by referring to them as W0–W30 and WZR. When we do this, the instruction will use the lower 32 bits of the register and set the upper 32 bits to zero.

Using 32 bits saves memory.

ARM Instruction Format

Each ARM binary instruction is 32 bits long.
Every bit in the instructin is used to tell the processor what to do.
There are quite a few instruction formats, and it can be helpful to know how the bits for each instruction are packed into 32 bits.

Since there are 32 registers in used mode, it takes 5 bits to specify a register.

Having small fixed length instructions, it doesn’t need to start decoding an instruction to know how long it is and hence where the next instruction starts.
This is a key feature to allowing processing parallelism and efficiency.

Each instruction that takes registers can either use the 32-bit W version or the 64-bit Z version.
To specify which is the case, the high bit of each instruction specifies how we are viewing the registers.

Data processing are move, arithmetic, logical, comparison and multiply instructions.
The instruction encoding of the data processing instruction:

An instruction in isolation takes three clock cycles,

one to load the instruction from memory
one to decode the instruction, and then
one to execute the instruction

The ARM is smart and works on three instructions at a time, each at a different step in the process, called the instruction pipeline.

Computer Memory

The 64-bit mode means:

Memory addresses are specified using 64 bits.
The CPU registers are each 64 bits wide and perform 64-bit integer arithmetic.

Instructions are 32 bits in size.

You can load from memory by using a register to specify the address to load.
This is called indirect memory access.

About the GCC Assembler

The general way you specify Assembly instructions is:


label:     opcode    operands

label:
opcodes

ADD for addition
LDR for load a register
B for branch

There are quite a few different formats for the operands

Install the GNU Compilers Collection (GCC)’s toolchain for the x86_64 platform


$ sudo apt install -y build-essential
$ sudo apt install -y crossbuild-essential-arm64
$ sudo apt install -y crossbuild-essential-armhf

Native toolchains


$ sudo apt update && sudo apt dist-upgrade
$ sudo apt-get install build-essential gawk gcc g++ gfortran git texinfo bison libncurses-dev bc flex libssl-dev make

Hello World

HelloWorld.s:


.global _start // Provide program starting address
_start:
    mov     x0, #1      /* 1 = StdOut */
    ldr       x1, =helloworld /* string to print */
    mov     x2, #13     /* length of our string */
    mov     x8, #64     /* linux write() system call */
    svc      0           /* call Linux system call */

    // setup parameters to exit the program gracefully
    mov    x0, #0     // return code = 0
    mov    x8, #93    // service call 93
    svc     0           /* call Linux system call */

.data
helloworld:    .ascii  "Hello World!\n"

Build the execute:


$ as -o HelloWorld.o HelloWorld.s
$ ld -o HelloWorld HelloWorld.o
$ ./HelloWorld
Hello World!

About Comments

This is the same as comments in C/C++ code:

//double slashes
/∗ and ∗/

Where to Start

The Assembler marks the statement containing _start as the program entry point; then the linker can find it.
only one file can contain _start.

Assembly Instructions


svc 0

command that executes software interrupt number 0.
This branches to the interrupt handler in the Linux kernel.

Data

A label “helloworld” followed by an .ascii directive which allocates one or more bytes of memory in the current section, and defines the initial contents of the memory from a string literal.

Calling Linux

This program makes two Linux system calls to do its work:

The first is the Linux write to file command (#64).

For any Linux system call,

Each system call number is specified by putting its function number in X8.
put the parameters in registers X0–X7 depending on how many parameters are needed.
a return code is placed in X0 for checking the execution result

The software interrupt has another benefit of providing a standard mechanism to switch privilege levels.

Reverse Engineering Our Program


$ objdump -s -d HelloWorld.o

HelloWorld.o:     file format elf64-littleaarch64

Contents of section .text:
 0000 200080d2 e1000058 a20180d2 080880d2   ......X........
 0010 010000d4 000080d2 a80b80d2 010000d4  ................
 0020 00000000 00000000                    ........        
Contents of section .data:
 0000 48656c6c 6f20576f 726c6421 0a        Hello World!.   

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d2800020 	mov	x0, #0x1                   	// #1
   4:	580000e1 	ldr	x1, 20 <_start+0x20>
   8:	d28001a2 	mov	x2, #0xd                   	// #13
   c:	d2800808 	mov	x8, #0x40                  	// #64
  10:	d4000001 	svc	#0x0
  14:	d2800000 	mov	x0, #0x0                   	// #0
  18:	d2800ba8 	mov	x8, #0x5d                  	// #93
  1c:	d4000001 	svc	#0x0
	...

Let’s investigate the binary representation of the first MOV instruction which compiled to 0xd2800020:

The 1st bit is 1

64-bit

The 3rd bit is 0

affect conditional instructions

The 2nd bit combined with the 4-th to 9-th bits make up the opcode for this MOV instruction.
The 10-th and 11-th bits of 0 indicate there is no shift operation involved.
The 12-th to 27-th bits are the immediate value which is 1
The last 5 bits are the register to load.

Chapter 2: Loading and Adding

To understand the ARM instruction set by going slowly through the MOV and ADD instructions.

Negative Numbers

The CPU must look at the sign bits, then decide whether to add or subtract and in which order.

About Two’s Complement

Two’s complement is to change all the 1s to 0s and all the 0s to 1s and then add 1.

-3 can be represented as

  
  ~ (0000 0011) +1 = 1111 1101 = 0xFD

For 1 byte calculation,

  
5 - 3 = 5 + (-3) = 5 + 0xFD = 0x102 = 2

About Gnome Programmer’s Calculator

The Gnome programmer’s calculator can calculate the two’s complement.

About One’s Complement

If we don’t add 1, and just change all the 1s to 0s and vice versa, then this is called one’s complement.

Big vs. Little Endian

Big endian is how we normally deal with numbers: the most significant byte or digits are placed leftmost in the structure (the big end, the low memory address). Known as the "network byte order," the TCP/IP Internet protocol also uses big endian regardless of the hardware at either end.

About Bi-endian

Pros of Little Endian

Even though Linux uses little endian, many protocols like TCP/IP used on the Internet use big endian and so require a transformation when moving data from the computer to the outside world.

Shifting and Rotating

0x30 = 3 * 16 = 3 * 2 4

About Carry Flag

When instructions execute, they can optionally set some flags that contain useful information on what happened. Then other instructions can test these flags and process accordingly.

About the Barrel Shifter

Basics of Shifting and Rotating

Logical shift left
Logical shift right
Arithmetic shift right
Rotate right

Loading Registers

Instruction Aliases

MOV isn’t an ARM Assembly instruction; it’s an alias.
The Assembler finds a real ARM instruction to do the job.
For ex.,


ADD X0, XZR, X1

This instruction adds the contents of register X1 to the zero register and puts the result in X0.

If you use objdump, it might show the same alias you used, another alternate alias, or the real instruction. There is a “-M no-aliases” option for objdump where you can see the true underlying instruction.

MOV/MOVK/MOVN

There are several forms of the MOV instruction:

MOV(Register to Register)


MOV X1, X2

MOVK(move keep)


MOV     X2, #0x6E3A
MOVK   X2, #0x4F5D, LSL #16
MOVK   X2, #0xFEDC, LSL #32
MOVK   X2, #0x1234, LSL #48

About Operand2

All the ARM’s data processing instructions have the option of taking a flexible Operand2 as one of their parameters.
There are three formats for Operand2:

A register and a shift


MOV   X1, X2, LSL #1    // Logical shift left
MOV   X1, X2, LSR #1    // Logical shift right
MOV   X1, X2, ASR #1    // Arithmetic shift right
MOV   X1, X2, ROR #1   // Rotate right


LSL   X1, X2, #1// Logical shift left
LSR   X1, X2, #1// Logical shift right
ASR   X1, X2, #1// Arithmetic shift right
ROR   X1, X2, #1// Rotate right

A register and an extension operation

extension operations

uxtb
uxth
uxtw
sxtb
sxth
sxtw

A small number and a shift


  // Too big for #imm16
     MOV    X1, #0xAB000000


MOV   x1, #0xAB00, LSL #16

MOVN(Move Not)

It works just like MOV, except it reverses all the 1s and 0s as it loads the register.
It applies a logical NOT operation to each bit in the word you are loading into the register.
Its main usage:

To calculate the one’s complement
Multiply by -1.

MOV Examples

The example to illustrate the MOV instructions.
This program doesn’t do anything besides move various numbers into registers.
movexamps.s,


// Examples of the MOV instruction.
//
.global _start  // Provide program starting address

// Load X2 with 0x1234FEDC4F5D6E3A first using MOV and MOVK
_start:
    mov x2, #0x6E3A
    MOVK X2, #0x4F5D, LSL #16
    MOVK X2, #0xFEDC, LSL #32
    MOVK X2, #0x1234, LSL #48
    // Just move W2 into W1
    MOV W1, W2
    // Now lets see all the shift versions of MOV
    MOV X1,X2,LSL #1  // Logical shift left
    MOV X1, X2, LSR #1 // Logical shift right
    MOV X1, X2, ASR #1 // Arithmetic shift right
    // Repeat the above shifts using mnemonics.
    LSL X1,X2,#1  // Logical shift left
    LSR X1,X2,#1  // Logical shift right
    ASR X1,X2,#1  //Arithmetic shift right
    ROR X1,X2,#1  // Rotate right

    // Example that works with 8 bit immediate and shift
    MOV X1, #0xAB000000  // Too big for #imm16
    // Example that can't be represented and results in an error
    // Uncomment the instruction if you want to see the error
    //   MOV   X1, #0xABCDEF11  // Too big for #imm16 and can't be represented.

    // Example of MOVN
    MOVN W1, #45

    // Example of a MOV that the Assembler will change to MOVN
    MOV W1, #0xFFFFFFFE  // (-2)

    // Setup the parameters to exit the program
    // and then call Linux to do it.
    MOV X0, #0  // Use 0 return code
    MOV X8, #93  // Serv command code 93 terms
    SVC 0  // Call linux to terminate

We can see the true ARM 64-bit instructions that are produced by the Assembler by objdump:


$ objdump -s -d -M no-aliases movexamps.o

movexamps.o:     file format elf64-littleaarch64

Contents of section .text:
 0000 42c78dd2 a2eba9f2 82dbdff2 8246e2f2  B............F..
 0010 e103022a e10702aa e10742aa e10782aa  ...*......B.....
 0020 41f87fd3 41fc41d3 41fc4193 4104c293  A...A.A.A.A.A...
 0030 0160b5d2 a1058012 21008012 000080d2  .`......!.......
 0040 a80b80d2 010000d4                    ........        

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d28dc742 	movz	x2, #0x6e3a
   4:	f2a9eba2 	movk	x2, #0x4f5d, lsl #16
   8:	f2dfdb82 	movk	x2, #0xfedc, lsl #32
   c:	f2e24682 	movk	x2, #0x1234, lsl #48
  10:	2a0203e1 	orr	w1, wzr, w2
  14:	aa0207e1 	orr	x1, xzr, x2, lsl #1
  18:	aa4207e1 	orr	x1, xzr, x2, lsr #1
  1c:	aa8207e1 	orr	x1, xzr, x2, asr #1
  20:	d37ff841 	ubfm	x1, x2, #63, #62
  24:	d341fc41 	ubfm	x1, x2, #1, #63
  28:	9341fc41 	sbfm	x1, x2, #1, #63
  2c:	93c20441 	extr	x1, x2, x2, #1
  30:	d2b56001 	movz	x1, #0xab00, lsl #16
  34:	128005a1 	movn	w1, #0x2d
  38:	12800021 	movn	w1, #0x1
  3c:	d2800000 	movz	x0, #0x0
  40:	d2800ba8 	movz	x8, #0x5d
  44:	d4000001 	svc	#0x0

We can see the shift instructions were converted into UBFM, SBFM, and EXTR instructions.

ADD/ADC

These instructions all add their second and third parameters and put the result in their first parameter register destination (Rd):


ADD{S} Xd, Xs, Operand2
ADC{S} Xd, Xs, Operand2

The registers Rd and source register (Rs) can be the same.
Examples,


// the immediate value can be 12-bits, so 0-4095
// X2 = X1 + 4000
   ADD   X2, X1, #4000
// the shift on an immediate can be 0 or 12
// X2 = X1 + 0x20000
   ADD   X2, X1, #0x20, LSL 12
// simple addition of two registers
// X2 = X1 + X0
   ADD   X2, X1, X0
// addition of a register with a shifted register
// X2 = X1 + (X0 * 4)
   ADD   X2, X1, X0, LSL 2
// With register extension options
// X2 = X1 + signed extended byte(X0)
   ADD   X2, X1, X0, SXTB
// X2 = X1 + zero extended hal

To print out a number, we must first convert the number to an ASCII string.
There is a trick, we can get one number from our program via the program’s return code.


/* This is a comment */
.global _start /* 'main' is our entry point and must be global */

_start:          /* This is main */
    mov w0, #2 /* Put a 2 inside the register w0 */
    // Setup the parameters to exit the program and then call Linux to do it.
    // W0 is the return code
    MOV X8, #93  // Service command code 93
    SVC 0  // Call linux to terminate

To see the return code after execution:


$ echo $?
2

Add with Carry

We can combine multiple ADD instructions to add arbitrarily large integers. The key to this is the carry flag.
When an addition overflows, it sets the carry flag.
The ARM processor adds 64 bits at a time, so we only need the carry flag if we are dealing with numbers larger than what will fit into 64 bits.
If we want an instruction to alter them, then we place an “S” on the end of the opcode, and the Assembler will set the carry flag( bit 29 ) when it builds binary version of the instruction.

This example will add two 128-bit integers,

registers X2 and X3 for the first 12b-bit number
registers X4 and X5 for the first 12b-bit number
X0 and X1 for the result.


ADDS  X1, X3, X5  // Lower order 64-bits
ADC   X0, X2, X4  // Higher order 64-bits

ADDS adds the lower order 64 bits and sets the carry flag
ADDC adds the higher-order words, plus the carry flag

SUB/SBC


SUB{S} Xd, Xs, Operand2
SBC{S} Xd, Xs, Operand2

The carry flag is used to indicate when a borrow is necessary.
SUBS will clear the carry flag if the result is negative and set it if positive; SBC then subtracts one if the carry flag is clear.

Chapter 3: Tooling Up

GNU Make

Rebuilding a File

A Rule for Building .s Files


%.o : %.s
    as $< -o $@
HelloWorld: HelloWorld.o
     ld -o HelloWorld HelloWorld.o

a wildcard

source file

output file

Defining Variables


TARGET = HelloWorld
OBJS = $(TARGET).o

GDB


sudo apt-get install gdb

Preparing to Debug

To add debug information to our program, we must Assemble it with the -g flag.
Use a Makefile variable to control the debug flag,


ifdef DEBUG
DEBUGFLGS = -g
else
DEBUGFLGS =
endi

Beginning GDB

Commands:

gdb executable
run
list
disassemble _start
b _start
s
i r
c
i b
delete 1
x /Nfu addr

t
binary
x
hexadecimal
d
decimal
i
instruction
s
string

b
bytes
h
halfwords (16 bits)
w
words (32 bits)
g
giant words (64 bits)

q
quit gdb


(gdb) x /4ubft _start
0x400078 <_start>:    01000010   11000111
10001101   11010010
(gdb) x /4ubfi _start
   0x400078 <_start>:   mov    x2, #0x6e3a          // #28218
=> 0x40007c <_start+4>: movk    x2, #0x4f5d, lsl #16
   0x400080 <_start+8>: movk    x2, #0xfedc, lsl #32
   0x400084 <_start+12>: movk    x2, #0x1234, lsl #48
(gdb) x /4ubfx _start
0x400078 <_start>:      0x42    0xc7    0x8d    0xd2
(gdb) x /4ubfd _start
0x400078 <_start>:      66      -57     -115    -46

Cross-Compiling

Get all the necessary GNU and Linux tools to compile for ARM,


sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu

These tools will be installed under /usr/aarch64-linux-gnu/ so that it will not be used in Intel-based host machine by default path.
To use the cross-platform tools, add this path in our makefile:


TOOLPATH = /usr/aarch64-linux-gnu/bin
HelloWorld: HelloWorld.o
     $(TOOLPATH)/ld -o HelloWorld HelloWorld.o
HelloWorld.o: HelloWorld.s
     $(TOOLPATH)/as -o HelloWorld.o HelloWorld.s

It can be faster to do your builds on a more powerful laptop or desktop than on the target.
The workflow is to build the program on a full development (native) system and then transfer the program to the target processor using a USB cable, serial cable, or via Ethernet.

Emulation

There are quite a few different emulators available with Ubuntu Linux running on an Intel CPU.
To play around with Arm assembly without an Arm board, the QEMU user mode emulation is more than sufficient.

Executing ARM64 binaries (C to Binary)


$ sudo apt install qemu-user qemu-user-static gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu binutils-aarch64-linux-gnu-dbg build-essential


#include <stdio.h>

int main(void) { 
    return printf("Hello, I'm executing ARM64 instructions!\n"); 
}


$ aarch64-linux-gnu-gcc -static -o hello64 hello.c
$ file hello64
hello64: ELF 64-bit LSB executable, ARM aarch64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=f6e13f22124754ff411cd4c40011b3da72388684, for GNU/Linux 3.7.0, not stripped

qemu-user-static

directly


$ ./hello64
Hello, I'm executing ARM64 instructions!

dynamically

qemu-user

-static


$ aarch64-linux-gnu-gcc -o hello64dyn hello.c
$ file ./hello64dyn
./hello64dyn: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, BuildID[sha1]=8d5a19d29c460ef70c98912db056e5e1ca9e9607, for GNU/Linux 3.7.0, not stripped

qemu-aarch64

aarch64 libraries

-L


$ qemu-aarch64 -L /usr/aarch64-linux-gnu ./hello64dyn
Hello, I'm executing ARM64 instructions!

Executing ARM32 binaries (C to Binary)


sudo apt install gcc-arm-linux-gnueabihf binutils-arm-linux-gnueabihf binutils-arm-linux-gnueabihf-dbg

Android NDK

Apple XCode

Source Control and Build Servers

Git

Jenkins

Chapter 4: Controlling Program Flow

Unconditional Branch

An unconditional branch to a labe:


B label

The label is interpreted as an offset from the current PC register and has 26 bits in the instruction.
This allows a jump of up to 128 megabytes in either direction.
An endless loop:



_start:   MOV X1, #1
           B _start

About Condition Flags

The condition flags are

Negative
Zero

Carry

OVerflow

signed overflow

These flags are stored in the NZCV system register.
These flags are only set if you append an “S” to the end of the instruction’s opcode, otherwise the flags will remain unmodified.

Branch on Condition

To only branch if a certain condition flags are set or clear.


B.{condition} label

where {condition} is taken from the following:

For ex.,


B.EQ _start

will branch to _start if the Z flag is set.

About the CMP Instruction


CMP Xn, Operand2

This instruction compares the contents of register Xn with Operand2.
This instruction is equivalent to


SUBS XZR, Xn, Operand2

The status flag will be updated accordingly. For example, to do a branch only if register W4 is 45,


B.EQ _start

Loops

Loops can be constructed with branch and comparison instructions.

FOR Loops


FOR I = 1 to 10
     ... some statements...

The above can be implemented:


      MOV W2, #1     // W2 holds I
loop: // body of the loop goes here.
      // Most of the logic is at the end
      ADD W2, W2, #1 // I = I + 1
      CMP W2, #10
      B.LE loop      // IF I <= 10 goto loop

While Loop


// WHILE X < 5
//      ... other statements ....
// END WHILE
// W4 is X and has been initialized
loop: CMP  W4, #5
      B.GE loopdone
      // ... other statements in the loop body ...
      B    loop
loopdone: // program continues

If/Then/Else

For ex,


IF W5 < 10 THEN
     .... if statements ...
ELSE
     ... else statements ...
END IF

Implement:


     CMP W5, #10
     B.GE elseclause
     ... if statements ...
     B endif
elseclause:
     ... else statements ...
endif:     // continue on after the /then/else ...

Logical Operators

The ARM’s logical operators manipulate the bits in the registers.


AND{S}  Xd, Xs, Operand2
EOR{S}  Xd, Xs, Operand2
ORR{S}  Xd, Xs, Operand2
BIC{S}   Xd, Xs, Operand2

AND

AND performs a bitwise logical and operation between each bit in Xs and Operand2, putting the result in Xd.
For ex., if we only want the high-order byte of a register


     AND   W6, W6, #0xFF000000
     // shift the byte down to the
     // low order position.
     LSR   W6, W6, #24

EOR

EOR performs a bitwise exclusive or operation between each bit in Xs and Operand2, putting the result in Xd.

ORR

ORR performs a bitwise logical or operation between each bit in Xs and Operand2, putting the result in Xd.
For ex., set the low-order byte of X6 to all 1 bits (0xFF) while leaving the seven other bytes unaffected.


ORR   X6, X6, #0xFF

BIC

BIC (bit clear) performs Xs AND NOT Operand2.
The reason this is called bit clear is that

if the bit in Operand2 is 1, then the resulting bit will be 0.


BIC   X6, X6, #0xFF

if the bit in Operand2 is 0, then the corresponding bit in Xs will be put in the result Xd.

Design Patterns

If you adopt a few standard design patterns for how to perform loops and other programming constructs, it will make reading your programs much easier.

Converting Integers to ASCII

Pseudo-code to print a register:


outstr = memory where we want the string + 9
// (string is form 0x123456789ABCDEF0 and we want
// the last character)
FOR W5 = 16 TO 1 STEP -1
      digit = X4 AND 0xf
      IF digit < 10 THEN
           asciichar = digit + '0'
      ELSE
           asciichar = digit + 'A' - 10
      END IF
      *outstr = asciichar
      outstr = outstr - 1
NEXT W5

printdword.s:


//
// Assembler program to print a register in hex
// to stdout.
//
// X0-X2 - parameters to linux function services
// X1 - is also address of byte we are writing
// X4 - register to print
// W5 - loop index
// W6 - current character
// X8 - linux function number
//
.global _start      // Provide program starting address
_start: MOV      X4, #0x6E3A
     MOVK X4, #0x4F5D, LSL #16
    MOVK X4, #0xFEDC, LSL #32
    MOVK X4, #0x1234, LSL #48
    
    LDR X1, =hexstr // start of string
    ADD X1, X1, #17 // start at least sig digit
   // The loop is FOR W5 = 16 TO 1 STEP -1
     MOV    W5, #16     // 16 digits to print
loop:AND    W6, W4, #0xf // mask of least sig digit
     // If W6 >= 10 then goto letter
      CMP  W6, #10        // is 0-9 or A-F
      B.GE letter
    // Else its a number so convert to an ASCII digit
     ADD   W6, W6, #'0'
     B     cont  // goto to end if
letter: // handle the digits A to F
     ADD   W6, W6, #('A'-10)
cont:// end if
     STRB  W6, [X1] // store ascii digit
     SUB   X1, X1, #1 // decrement address for next digit
     LSR   X4, X4, #4 // shift off the digit
     // next W5
     SUBS   W5, W5, #1    // step W5 by -1
     B.NE   loop          // another for loop if not done
    // Setup the parameters to print our hex number
    // and then call Linux to do it.
     mov     X0, #1       // 1 = StdOut
     ldr     X1, =hexstr  // string to print
     mov     X2, #19  // length of our string
     mov     X8, #64  // linux write system call
     svc     0     // Call linux to output the string
     // Setup the parameters to exit the program
     // and then call Linux to do it.
     mov     X0, #0  // Use 0 return code
     mov     X8, #93  // Service code 93 terminates
     svc     0           // Call linux to terminate
.data
hexstr: .ascii  "0x123456789ABCDEFG\n"

compile and execute the program,


$ as  printdword.s -o printdword.o
$ ld -o printdword printdword.o
$ ./printdword
0x1234FEDC4F5D6E3A

Using Expressions in Immediate Constants


ADD   W6, W6, #('A'-10)

Storing a Register to Memory


STRB W6, [X1]

The store byte (STRB) instruction saves the low-order byte of the first register into the memory location contained in X1.
The syntax [X1] is to make clear that we are using memory indirection, and not just putting the byte into register X1.

Why Not Print in Decimal

Performance of Branch Instructions

If you put a lot of branches in your code, you suffer a performance penalty.

More Comparison Instructions

Summary

Chapter 5: Thanks for the Memories

how to define data in memory
how to load memory into registers for processing
how to write the results back to memory

Memory addresses are 64 bits while instructions are 32 bit.

Defining Memory Contents

The GNU Assembler contains several directives to help you define memory in a .data section of your program.
Some sample memory directives:


label:
       .byte 74, 0112, 0b00101010, 0x4A, 0X4a, 'J', 'H' + 2
       .word 0x1234ABCD, -1434
       .quad 0x123456789ABCDEF0
       .ascii      "Hello World\n"

The .byte statement defines 1 or more bytes of memory.
The list of memory definition Assembler directives,

Aligning Data

These data directives put the data in memory contiguously byte by byte.
We can instruct the Assembler to align the next piece of data with an .align directive.
For ex.,


.data
     .byte    0x3F
     .align   4
     .word   0x12345678

The first is only 1 byte, the next word of data will not be aligned.
We can add the “.align 4” directive to make it word aligned.
This will result in three wasted bytes.
ARM Assembly instructions must be word aligned.
Usually the Assembler will give you an error when alignment is required, and throwing in an “.align 4” directive is a quick fix.

Loading a Register with an Address

PC Relative Addressing

Addresses can be represented as a register-relative or PC-relative expression.

A register-relative expression evaluates to a named register combined with a numeric expression.
A PC-relative expression is written in source code as the PC or a label combined with a numeric expression.

PC relative addressing


[PC, #number]


        LDR     r4,=data+4*n    ; n is an assembly-time variable
        ; code
        MOV     pc,lr
data    DCD     value_0
        ; n-1 DCD directives
        DCD     value_n         ; data+4*n points here
        ; more DCD directives


LDR   X1, =helloworld

Loading Data from Memory

The simple form of LDR to load data given an address is


LDR{type}   Xt, [Xa]

where type is one of the types:

the typical usage to load an address into a register and then use that address to load the data we want,


// load the address of mynumber into X1
      LDR   X1, =mynumber
// load the word stored at mynumber into X2
      LDR   X2,[X1]
      
.data
mynumber:   .QUAD 0x123456789ABCDEF0

it load 0x123456789ABCDEF0 into X2.

Note the square bracket syntax represents indirect memory access.
This means load the data stored at the address pointed to by X1, not move the contents of X1 into X2.

Indexing Through Memory

The ARM instruction set gives us support for the array indexing operation.
Suppose we have an array of 10 words (4 bytes each) defined:


arr1:   .FILL   10, 4, 0

      LDR    X1, =arr1                   ; load the array’s address
      // Load the first element
      LDR    W2, [X1]
      // Load element 3
      // The elements count from 0, so 2 is
      // the third one. Each word is 4 bytes,
      // so we need to multiply by 4
      LDR    W2, [X1, #(2 * 4)]

Using a register as an offset


// The 3rd element is still number 2
      MOV   X3, #(2 * 4)
// Add the offset in X3 to X1 to get our element.
      LDR   W2, [X1, X3]

If X1 points to the end of the array, we can do indexing shifts in reverse


LDR   W2, [X1, #-(2 * 4)]
MOV   X3, #(-2 * 4)
LDR   W2, [X1, X3]

Post-Indexed Addressing:


// Load X1 with the memory pointed to by X2
// Then do X2 = X2 + 2
   LDR   X1, [X2], #2

An Example Converting to Upper-Case

Pseudo-code:


i= 0
DO
    char = inStr[i]
    IF char >= 'a' AND char <= 'z' THEN
          char = char - ('a' - 'A')
    END IF
    outStr[i] = char
    i=i+ 1
UNTIL char == 0
PRINT outStr

in this ex., NULL-terminated strings is used, the input string is not changed, a new output string with the upper-case version of the input string is generated.
upper.s:


//
// X0-X2 - parameters to Linux function services
// X3 - address of output string
// X4 - address of input string
// W5 - current character being processed
// X8 - linux function number
//
.global _start // Provide program starting address to linker
_start: LDR   X4, =instr      // start of input string
          LDR   X3, =outstr     // address of output string
// The loop is until byte pointed to by X1 is non-zero
loop: LDRB W5, [X4], #1 // load character and incr pointer
// If W5 > 'z' then goto cont
       CMP   W5, #'z'         // is letter > 'z'?
       B.GT  cont
// Else if W5 < 'a' then goto end if
       CMP   W5, #'a'
       B.LT  cont            // goto to end if
// if we got here then the letter is lower case, so convert it.
       SUB   W5, W5, #('a'-'A')
cont:  // end if
STRB W5, [X3], #1 // store character to output str CMP W5, #0 // stop on hitting a null character B.NE loop // loop if character isn't null
// Setup the parameters to print our hex number
// and then call Linux to do it.
MOV    X0, #1
LDR    X1, =outstr
SUB    X2, X3, X1
MOV    X8, #64
SVC    0
// 1 = StdOut
// string to print
// get the len by sub'ing the
   pointers
// Linux write system call
// Call Linux to output the string
// Setup the parameters to exit the program
// and then call Linux to do it.
MOV    X0, #0
MOV    X8, #93
SVC    0
// Use 0 return code
// Service code 93 terminates
// Call Linux to terminate the
program
.data
instr: .asciz "This is our Test String that we will convert.\n" outstr: .fill 255, 1, 0

compile and run the program,


$ as   upper.s -o upper.o
$ ld -o upper upper.o
$ ./upper
THIS IS OUR TEST STRING THAT WE WILL CONVERT.

LDR and STR just load and save; they don’t have functionality to examine what they are loading or saving, so they can’t set the condition flags, hence the need for the CMP instruction in the UNTIL part of the loop to test for NULL.

Storing a Register

The STR instruction is a mirror of the LDR instruction.

Double Registers

There are doubleword versions of all the LDR and STR instructions: LDP and STP.
For example, to load the address of a 128-bit quantity (the address is still 64 bits) and then loads the 128 bits into X2 and X3. Then we store X2 and X3 back into the myoctaword:


      LDR   X1, =myoctaword
      LDP   X2, X3, [X1]
      STP   X2, X3, [X1]
.data
myoctaword: .OCTA 0x12345678876543211234567887654321

these instructions are extensively used when we need to save registers to the stack and later restore them.

Summary

Chapter 6: Functions and the Stack

Stacks on Linux

Branch with Link

Nesting Function Calls

Function Parameters and Return Values

Managing the Registers

Summary of the Function Call Algorithm

Upper-Case Revisited

Stack Frames

Stack Frame Example

Macros

Include Directive

Macro Definition

Labels

Why Macros

Macros to Improve Code

Summary

Chapter 7: Linux Operating System Service

So Many Services

Calling Convention

Linux System Call Numbers

Return Codes

Structures

Wrappers

Converting a File to Upper-Case

Building .S Files

Opening a File

Error Checking

Looping

Summary

Chapter 8: Programming GPIO Pins

We can program the GPIO pins in two ways:

by using the Linux device driver
by accessing the GPIO controller’s registers directly

GPIO Overview

On the raspberry Pi, pins 3, 5, 7–8, 10–13, 15, 16, 18, 19, 21–24, and 26: Are programmable general purpose.

In Linux, Everything Is a File

Flashing LEDs

Moving Closer to the Metal

Virtual Memory

In Devices, Everything Is Memory

Registers in Bits

GPIO Function Select Registers

GPIO Output Set and Clear Registers

More Flashing LEDs

Root Access

Table Driven

Setting Pin Direction

Setting and Clearing Pins

Summary

Chapter 9: Interacting with C and Pythons

Calling C Routines

Printing Debug Information

Adding with Carry Revisited

Calling Assembly Routines from C

Packaging Our Code

Static Library

Shared Library

Embedding Assembly Code Inside C Code

Calling Assembly from Python

Summary

Chapter 10: Interfacing with Kotlin and Swift

Chapter 11: Multiply, Divide, and Accumulate

Chapter 12: Floating-Point Operations

Chapter 13: Neon Coprocessor

Chapter 14: Optimizing Code

Chapter 15: Reading and Understanding Code

Chapter 16: Hacking Code

Appendix A: The ARM Instruction Set

Appendix B: Binary Formats

Appendix C: Assembler Directive

Appendix D: ASCII Character Set

ARM (32-bits) assembler in Raspberry Pi

1 Introduction

2 Registers and basic arithmetic

3 Memory, addresses. Load and store.

4 GDB

5 Branches

6 Control structures

7 Indexing modes

8 Arrays and structures and more indexing modes.

9 Functions (I)

10 Functions (II). The stack

11 Predication

12 Loops and the status register

13 Floating point numbers

14 Matrix multiply

15 Integer division

16 Switch control structure

17 Passing data to functions

18 Local data and the frame pointer

19 The operating system

20 Indirect calls

21 Subword data

22 The Thumb instruction set

23 Nested functions

24 Trampolines

25 Integer SIMD

26 A primer about linking

27 Dynamic linking

Introduction to Computer Organization: ARM Assembly Language Using the Raspberry Pi

Robert G. Plantz

Chapter 1 Introduction

This book begins with the fundamental high-level language concepts and “looks under the hood” to see how they are implemented at the assembly language level.

There are many challenging opportunities in programming embedded systems, and much of the work in this area demands at least an understanding of the ISA(instruction set architecture).

1.1 Efficient Use of This Book

1.2 Computer Subsystems

The von Neumann architecture: both the program instructions and data are stored in a memory unit that is separate from the processing unit.
We will focus on how the program and data are stored in memory and how the CPU executes instructions.

1.3 How the Subsystems Interact

The buses shown here are logical groupings of the signals that must pass between the three subsystems.
For example, the PCI bus standard uses the same physical pathway for the address and the data, but at different times.
Control signals indicate whether there is an address or data on the lines at any given time.

If the CPU is instructed to store data in memory, it places the data on the data bus, places the location in memory where the data is to be stored on the address bus, and places a “write” signal on the control bus. The memory subsystem responds by copying the data on the data bus into the specified memory location.

1.4 Setting Up Your Raspberry Pi

Installing the binutils-doc package to get full documentation for the GNU assembler, as.

Chapter 2 Data Storage Formats

2.1 Bits and Groups of Bits

2.2 Exercises

2.3 Mathematical Equivalence of Binary and Decimal

2.4 Exercises

2.5 Unsigned Decimal to Binary Conversion

2.6 Exercises

2.7 Memory

2.8 Exercises

2.9 Using C Programs to Explore Data Formats

2.10 Programming Exercises

2.11 Examining Memory With a Debugger


/* intAndFloat.c
 * Using printf to display an integer and a float.
 * 2017-09-29: Bob Plantz
 */
#include <stdio.h>

int main(void)
{
  int anInt = 19088743;
  float aFloat = 19088.743;

  printf("The integer is %d and the float is %f\n", anInt, aFloat);

  return 0;
}

Build the example the run the gdb:


$ gcc -g -Wall -o intAndFloat intAndFloat.c

$ gdb ./intAndFloat
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
...

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./intAndFloat...
(gdb)

gdb has a large number of commands.
The few here will be sufficient to get you started:

li LineNumber


(gdb) li
1	/* intAndFloat.c
2	 * Using printf to display an integer and a float.
3	 * 2017-09-29: Bob Plantz
4	 */
5	#include <stdio.h>
6	
7	int main(void)
8	{
9	  int anInt = 19088743;
10	  float aFloat = 19088.743;     
(gdb) 
11	
12	  printf("The integer is %d and the float is %f\n", anInt, aFloat);
13	
14	  return 0;
15	}
16

return

br source-filename:line-number


(gdb) br 12
Breakpoint 1 at 0x798: file intAndFloat.c, line 12.


(gdb) r
Starting program: /home/pi/intAndFloat 

Breakpoint 1, main () at intAndFloat.c:12
12	  printf("The integer is %d and the float is %f\n", anInt, aFloat);

print Expression


(gdb) print anInt
$1 = 19088743
(gdb) print aFloat
$2 = 19088.7422
(gdb) printf "anInt = %i and aFloat = %f\n", anInt, aFloat
anInt = 19088743 and aFloat = 19088.742188

help command


(gdb) help x
Examine memory: x/FMT ADDRESS.
ADDRESS is an expression for the memory address to examine.
FMT is a repeat count followed by a format letter and a size letter.
Format letters are o(octal), x(hex), d(decimal), u(unsigned decimal),
  t(binary), f(float), a(address), i(instruction), c(char), s(string)
  and z(hex, zero padded on the left).
Size letters are b(byte), h(halfword), w(word), g(giant, 8 bytes).
The specified number of objects of the specified size are printed
according to the format.  If a negative number is specified, memory is
examined backward from the address.

Defaults for format and size letters are those previously used.
Default count is 1.  Default address is following last thing printed
with this command or "print".

x/FMT MemoryAddress


(gdb) print &anInt
$3 = (int *) 0x7ffffff3dc
(gdb) print &aFloat
$4 = (float *) 0x7ffffff3d8

(gdb) x/1dw 0x7ffffff3dc
0x7ffffff3dc:	19088743
(gdb) x/1fw 0x7ffffff3d8
0x7ffffff3d8:	19088.7422
(gdb) x/1xw 0x7ffffff3dc
0x7ffffff3dc:	0x01234567
(gdb) x/4xb 0x7ffffff3dc
0x7ffffff3dc:	0x67	0x45	0x23	0x01

cont
i r
printf "format", var1, var2,…

2.12 Programming Exercise

2.13 Storing Characters

2.14 Programming Exercise

2.15 Low-level Character Handling

2.16 Programming Exercises

2.17 Accessing the GPIO in C

Chapter 3 Computer Arithmetic

3.1 Addition and Subtraction

3.2 Exercises

3.3 Arithmetic Errors—Unsigned Integers

Use four-bit values to simplify the discussion.
Consider addition of the two unsigned integers, 2 and 4:


   0010      0100      0100
+ 0100    + 1110    - 1110
------  ------  ------
   0110       0010      0110
   
Carry =0 Carry=1  Carry=1

These four-bit arithmetic examples generalize to any size arithmetic performed by the computer.
When adding or subtracting two unsigned integers, the result is arithmetically correct if and only if the carry condition flag (C) is set to zero.
the C flag in the CPSR register is always set to the appropriate value, or , each time an addition or subtraction is performed by the CPU.
In particular, the CPU will not ignore the C flag when there is no carry; it will actively set it to zero.

3.4 Signed Integers

3.5 Exercises

3.6 Arithmetic Errors—Signed Integers

The number of bits used to represent a value is determined at the time a program is written.
The flags register, CPSR, provides a bit, the overflow condition flag, V, for detecting whether the sum of two -bit, signed numbers stored in the two's complement code has exceeded the range allocated for it.


  1             >-- penultimate carry
  0001 0101
+ 0110 1111
---------
  1000 0100
 Carry=0

The V flag is equal to the exclusive or of carry and penultimate carry:


V = C  ^ penultimate carry

where ‘^’ is the exclusive or operator.

The CPU does not consider integers as either signed or unsigned.

If your algorithm treats the result as unsigned
If your algorithm treats the result as signed

Both C and V are set according to the rules of binary arithmetic by each arithmetic operation.
After each addition or subtraction operation the program should check the state of C for unsigned integers or V for signed integers and at least indicate when the sum is in error.

3.7 Exercises

Chapter 4 Basic Data Types

4.1 C/C++ Basic Data Types

4.2 Hexadecimal to Integer Conversion

4.3 Programming Exercise

4.4 Bitwise Logical Operations

4.5 Programming Exercise

4.6 Other Codes

Chapter 5 Boolean Algebra

5.1 Boolean Algebra Operations

5.2 Exercises

5.3 Canonical (Standard) Forms

5.4 Exercise

5.5 Boolean Function Minimization

Chapter 6 Logic Gates

6.1 Crash Course in Electronics

6.2 CMOS Transistors

6.3 NAND and NOR Gates

6.4 Exercise

Chapter 7 Logic Circuits

7.1 Combinational Logic Circuits

7.2 Programmable Logic Devices

7.3 Sequential Logic Circuits

7.4 Designing Sequential Circuits

7.5 Memory Organization

Chapter 8 Central Processing Unit

ARM CPUs used in different Raspberry Pi models.

The 64-bit ARM processor in the Raspberry Pi 3 B can be run in either AARCH32 (32-bit) or AARCH64 (64-bit) state.

8.1 Overview

CPU block diagram. The CPU communicates with the Memory and I/O subsystems via the Address, Data, and Control buses.

Program Counter

next

L1 Cache Memory

Instruction Register

currently

Control Unit
Register

named

memory

Arithmetic Logic Unit (ALU)
Bus Interface

external bus control units

Condition Flags

8.2 CPU Registers

A portion of the memory in the CPU is organized into registers. Machine instructions access CPU registers by their addresses.

The registers are in the CPU, the assembler has predefined names for the registers.
Applications programmers have access to 16 integer registers in the AARCH32 (32-bit) state, r0 — r15.
The names of the registers and their usage in AARCH32 state are summarized

  
Register	Register	
Name        Number      Usage
---------------------------------------
r0–r10      0–10	    General Purpose
r11 or fp   11	        Frame Pointer
r12 or ip   12	        Intraprocess scratch
r13 or sp   13	        Stack Pointer
r14 or lr   14	        Link Register
r15 or pc   15	        Program Counter

In AARCH64 (64-bit) state applications programmers have access to 30 integer registers.

  
Full 64-bit	        Low 32-bit      Register	
Register Name	    Register Name   Number    Usage
-------------------------------------------------------------
r0–r30 or x0–x30    w0–w30          0 - 30    General Purpose
sp                  wsp            31        Stack Pointer
xzr                 wzr            virtual   Zero Register

Using wn, where ,n=0,1,…,30, refers to the low-order 32-bit portion of the register.

If an instruction reads these 32 bits from the register, bits 63–32 are ignored, and if an instruction writes to the 32 bits, bits 63–32 are set to zero.
Many instructions can access one byte in a register, which consists of the bits 7–0 in the specified register. And accessing two bytes at a time works on bits 15–0 in the specified register. This is specified in the instruction, not in the register name.

8.3 CPU Interaction with Memory

If store one byte 0xcd at location 0x7efff174, the control unit then

places 0x7efff174 on the address bus
places 0xcd on the data bus, and then
places a “write” signal on the control bus.

8.4 Program Execution in the CPU

The CPU is programmed via the instruction register — whose bit pattern determines what the CPU will do.
Once that action has been completed, the bit pattern in the instruction register can be changed, and the CPU will perform the operation specified by this next bit pattern.

Most modern CPUs use an instruction queue.
Several instructions are waiting in the queue, ready to be executed.
Since instructions are simply bit patterns, they can be stored in memory.
The instruction pointer register always has the memory address of (points to) the next instruction to be executed.
In order for the control unit to execute this instruction, it is copied into the instruction register.

The senario is:

A sequence of instructions is stored in memory
The memory address where the first instruction is located is copied to the program counter
The CPU sends the address in the program counter to memory via the address bus.
Memory responds by sending a copy of the state of the bits at that memory location on the data bus, which the CPU then copies into its instruction register.
The instruction pointer is automatically incremented to contain the address of the next instruction in memory.
The CPU executes the instruction in the instruction register.
Go to step 3.

Steps 3, 4, and 5 are called an instruction fetch.
Steps 3–7 make up a cycle, the instruction execution cycle,

The wfi (“wait for interrupt”) instruction places the CPU in an idle state, where it remains until an I/O device sends an interrupt signal to the CPU.
Just to understand that the wfi instruction stops the program execution cycle.

The instructions for a program are stored in a file.
When you indicate to the operating system that you wish to execute a program, the operating system locates a region of memory large enough to hold the instructions in the program, and then copies them from the file to memory.

8.5 Using gdb to View the CPU Registers

We will use the following program to illustrate the use of gdb to view the contents of the CPU registers.


/* gdbExample1.c
 * Subtracts one from user integer.
 * Demonstrate use of gdb to examine registers, etc.
 * 2017-09-29: Bob Plantz
 */

#include <stdio.h>

int main(void)
{
  register int wye;
  int *ptr;
  int ex;

  ptr = &ex;
  ex = 305441741;
  wye = -1;
  printf("Enter an integer: ");
  scanf("%i", ptr);
  wye += *ptr;
  printf("The result is %i\n", wye);

  return 0;
}

Compile the program for gdb debugging:


$ gcc -g -O0 -Wall -o gdbExample1 gdbExample1.c

The “-g” option tells the compiler to include debugger information in the executable program.
The “-Wall” option causes the compiler to warn you about many constructions that might be a programming error.

The register storage class modifier to request that the compiler use a CPU register for the int* ptr variable.


$ gdb ./gdbExample1

Some additional commands that will be useful in this section:

lists ten lines of source code centered around the specified line number.


(gdb) li 11
6	
7	#include <stdio.h>
8	
9	int main(void)
10	{
11	  register int wye;
12	  int *ptr;
13	  int ex;
14	
15	  ptr = &ex;

set a breakpoint at line 18 then run the program


(gdb) br 18
Breakpoint 1 at 0x10478: file gdbExample1.c, line 18.
(gdb) run
Starting program: /home/pi/gdbExample1 

Breakpoint 1, main () at gdbExample1.c:18
18	  printf("Enter an integer: ");

use the print command to view the value


(gdb) print ex
$1 = 305441741
(gdb) print &ex
$2 = (int *) 0x7efff430

The help command will provide very brief instructions on using a command.


(gdb) help x
Examine memory: x/FMT ADDRESS.
ADDRESS is an expression for the memory address to examine.
FMT is a repeat count followed by a format letter and a size letter.
Format letters are o(octal), x(hex), d(decimal), u(unsigned decimal),
  t(binary), f(float), a(address), i(instruction), c(char), s(string)
  and z(hex, zero padded on the left).
Size letters are b(byte), h(halfword), w(word), g(giant, 8 bytes).
The specified number of objects of the specified size are printed
according to the format.  If a negative number is specified, memory is
examined backward from the address.

Defaults for format and size letters are those previously used.
Default count is 1.  Default address is following last thing printed
with this command or "print".

Examine momory in different formats

  
(gdb) x/1dw 0x7efff430
0x7efff430:	305441741
(gdb) x/1xw 0x7efff430
0x7efff430:	0x1234abcd
(gdb) x/4xb 0x7efff430
0x7efff430:	0xcd	0xab	0x34	0x12

0xcd is stored in the byte at address 0x7efff430
0xab is stored in the byte at address 0x7efff431
0x34 is stored in the byte at address 0x7efff432
0x12 is stored in the byte at address 0x7efff433

Examine variables


(gdb) print ptr
$2 = (int *)  0x7efff430
(gdb) print &ptr
$3 = (int **) 0x7efff504

displays the current contents of the CPU registers


(gdb) print wye
$4 = -1
(gdb) print &wye
Address requested for identifier "wye" which is in register $r4


(gdb) i r
r0             0x1                 1
r1             0x7efff674          2130703988
r2             0x7efff67c          2130703996
r3             0x1234abcd          305441741
r4             0xffffffff          4294967295
r5             0x0                 0
r6             0x10368             66408
r7             0x0                 0
r8             0x0                 0
r9             0x0                 0
r10            0x76fff000          1996484608
r11            0x7efff514          2130703636
r12            0x7efff528          2130703656
sp             0x7efff500          0x7efff500
lr             0x76e6abe0          1994828768
pc             0x10478             0x10478 <main+32>
cpsr           0x60000010          1610612752
fpscr          0x0                 0

The first column is the name of the register.
The second shows the current bit pattern in the register, in hexadecimal. Notice that leading zeros are not displayed.
The third column shows some the register contents in 32-bit unsigned decimal.

8.6 Programming Exercises

Chapter 9 Programming in Assembly Language

9.1 Program Organization


/* doNothingProg1.c
 * The minimum components of a C program.
 * 2017-09-29: Bob Plantz
 */

int main(void)
{
  return 0;
}

use the -S command line option to look at the assembly language that the compiler produces:


$ gcc -S -O0 doNothingProg1.c

-S
-O0

This is not easy to read the gcc generated assembly code,


        .arch armv6
        .eabi_attribute 28, 1
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 2
        .eabi_attribute 30, 6
        .eabi_attribute 34, 1
        .eabi_attribute 18, 4
        .file   "doNothingProg1.c"
        .text
        .align  2
        .global main
        .arch armv6
        .syntax unified
        .arm
        .fpu vfp
        .type   main, %function
main:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 1, uses_anonymous_args = 0
        @ link register save eliminated.
        str     fp, [sp, #-4]!
        add     fp, sp, #0
        mov     r3, #0
        mov     r0, r3
        add     sp, fp, #0
        @ sp needed
        ldr     fp, [sp], #4
        bx      lr
        .size   main, .-main
        .ident  "GCC: (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110"
        .section        .note.GNU-stack,"",%progbits

Use this programmer's version for investigation:


@ doNothingProg2.s
@ Minimum components of a C program, in assembly language.
@ 2017-09-29: Bob Plantz 

@ Define my Raspberry Pi
        .cpu    cortex-a53
        .fpu    neon-fp-armv8
        .syntax unified         @ modern syntax

@ Program code
        .text
        .align  2
        .global main
        .type   main, %function
main:
        str     fp, [sp, -4]!   @ save caller frame pointer
        add     fp, sp, 0       @ establish our frame pointer

        mov     r3, 0           @ return 0;
        mov     r0, r3          @ return values go in r0

        sub     sp, fp, 0       @ restore stack pointer
        ldr     fp, [sp], 4     @ restore caller's frame pointer
        bx      lr              @ back to caller

The assembly language is line-oriented. That is, there is only one assembly language statement on each line, and none of the statements spans more than one line.
The following assembly language statement is equivalent to the machine lamguage "0xe3a03000":


mov r3, 0

Next, notice that the pattern of each assembly line falls into one of three categories:

comment
Blank lines
statements


label:    operation    operand(s)    @ comment

label
operation

An assembly language mnemonic
An assembler directive or pseudo op begins with the period (‘.’)

operand
comment

identifier are very similar to those for C/C++.
Identifiers are called Symbol Names. Case is also significant.

Compiler-generated labels begin with the ‘.’ character
many system related names begin with the ‘_’ character.

Assembler Directives

Assembler directives are directions to the assembler to take some action or change a setting.
Assembler directives do not represent instructions, and are not translated into machine code.

For this assembler, all directives begin with a “.” or “#” (the comment is a #), and the directive must exist on a separate line from any other assembler directive or assembler instruction.
There are 4 main assembler directives:

.text

text segment

sections

.data

data segment

.label
.number

GNU/Linux divides memory into different segments for specific purposes when a program is loaded from the disk. The four general categories are:

Text Segment

Data Segment

Stack Segment

Heap Segment

malloc

The operating system needs to view an ELF file as a set of segments. One of the functions of the ld program is to group ELF sections together into segments so that they can be loaded into memory.
When the operating system loads the program into memory, it uses the segment view of the ELF file. Thus, for example, the contents of all the text sections will be loaded into the text segment of the program process.
The readelf program is also useful for learning about ELF files.

The AArch32 target selection directives specify code generation parameters for AArch32 targets.
The following three directives identify the characteristics of the ARM processor this code will run on:


.cpu     cortex-a53
.fpu     neon-fp-armv8
.syntax unified         @ modern syntax

There are many variations of the ARM architecture, and the assembler needs to know which one this code is intended for. The appropriate values for each directive for the various Raspberry Pi models are given below:

Raspberry Pi	.cpu	.fpu
Pi Zero
Pi 1 A+	arm1176jzf-s	vfp
Pi 1 B+
Pi 2 B	cortex-a7	neon-vfpv4
Pi 3 B	cortex-a53	neon-fp-armv8

The first assembler directive in the text segment has one operand, 2,


.align  2

For the ARM, this tells the assembler to ensure that the lowest two bits of the starting address of the generated code are zero.
That is, the addressing is adjusted, incremented if necessary, to be a multiple of four.
Each machine instruction is four bytes long, so this ensures proper alignment of the instructions in memory.

The .global directive makes the name globally known, code outside this file can refer to this name.


.global  main

When a program is executed, the operating system does some preliminary set up of system resources. It then starts program execution by calling a function named “main,” so the name must be global in scope.

The following declares the label, main, as the name of a function in the program.


.type   main, %function

This simply identifies the original C source code file,


.file:  "doNothingProg1.c"

The .size directive gives the number of bytes in the code, and the .ident directive lists the version of the compiler that produced this assembly language.

These directives are used to describe the characteristics of the statements that follow.
They are not translated into actual machine instructions, and none of them occupy any memory in the finished program.

9.2 First Assembly Language Instructions

To see the details of the instruction, you need to read the ARM manuals,

ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition for 32-bit
Architecture Reference Manual ARMv8, for ARMv8-A architecture profile for 64-bit

The ARM actually provides a second instruction set called “Thumb.” It allows for either 16-bit or 32-bit instructions.
I will use ‘%’ to add my comments.

9.2.1 Some Notation

The syntax that ARM uses for their assembly language is called Unified Assembler Language (UAL).
The assembler, as, recognizes the UAL syntax if you use the assembler directives to identify the ARM model correctly.
the version of gcc currently (August 2016) running on Raspbian uses pre-UAL syntax. The differences are minor.
For example, the compiler-generated assembly language uses a ‘#’ character to prefix each literal value:


        str     fp, [sp, #-4]!

But the UAL syntax specifies that the ‘#’ character is optional.
The ‘#’ character for immediate values will not be used in my examples in this book.
To use the UAL syntax when writing your own assembly language programs will become very important when we get to the floating-point instructions.

9.2.2 Condition Codes

Most AARCH32 ARM instructions have an option that allows you to specify that it will be executed only if a specific setting of the condition flags exists.
These settings are expressed by adding a mnemonic Condition Code to the instruction mnemonic.
Mnemonic suffixes for conditional execution of instructions. Meaning depends on whether the values are integers or floats:

The cond column shows the machine code.

9.2.3 Shift Options

Many ARM instructions include an option to shift one of the data values during the operation that the instruction performs.
Mnemonic codes for adding shifts to instructions. The ‘#’ is optional.

As an example of how the shifting syntax is used,


mov     r0, 12  #store 12 in r0
mov     r1, 60  # store 60 in r1
add     r2, r0, r1, lsl 2  #  lsl #2 left shifts the value in r1 two bit, r1 = 240

would store 252 in r2.

To let the amount of the shift be under program control,


mov     r0, 12
mov     r1, 60
mov     r3, 2
add     r2, r0, r1, lsl r3

9.2.4 First Instructions

Even though the program does nothing, it uses six instructions.


MOV{S}{<c>}   <Rd>, #<const>           % immediate
MOV{S}{<c>}   <Rd>, <Rm>               % register

c
Rd
Rm
const


MVN{S}{<c>}   <Rd>, #<const>           % immediate
MVN{S}{<c>}   <Rd>, <Rm>{, <shift>}    % register
MVN(S}{<c>}   <Rd>, <Rm>, <type> <Rs>  % register-shifted register


ADD{S}{<c>}  {<Rd>,} <Rn>, #<const>           % immediate
ADD{S}{<c>>}  {<Rd>,} <Rn>>, <Rm>{, <shift>}    % register
ADD{S}{<c>>}  {<Rd>,} <Rn>>,  <Rm>, <type> <Rs>  % register-shifted register


SUB{S}{<c>}   {<Rd>,} <Rn>, #<const>           % immediate
SUB{S}{<c>}   {<Rd>>,} <Rn>, <Rm>{, <shift>}    % register
SUB{S}{<c>}   {<Rd>,} <Rn>, <Rm>, <type> <Rs>  % register-shifted register


BX{<c>}    <Rm>


LDR<c>  <Rt>, <label>                  % Label
LDR<c>  <Rt>, [<Rn>{, #+/-<imm>}]      % Offset
LDR<c>  <Rt>, [<Rn>, #+/-<imm>]!       % Pre-indexed
LDR<c>  <Rt>, [<Rn>], #+/-<imm>        % Post-indexed

<Rt> is the destination register, and <Rn> is the base register
<label> is a labeled memory address

label form
offset form
Pre-indexed form
Post-indexed form


STR<c>>  <Rt>, <label>                  % Label
STR<c>  <Rt>,  [<Rn>{, #+/-<imm>}]      % Offset
STR<c>  <Rt>, [<Rn>, #+/-<imm>]!       % Pre-indexed
STR<c>  <Rt>>, [<Rn>], #+/-<imm>        % Post-indexed

<Rt> is the source register, and <Rn> is the base register.
<label> is a labeled memory address.

9.2.5 Code Walkthrough

每一個函數被執行時都有一個frame代表那函數的記憶體使用區,
指著目前函數區域變數開始存放的位址的系統變數則叫作 frame pointer。

A call stack is composed of stack frames .
The stack frame at the top of the stack is for the currently executing routine, which can access information within its frame (such as parameters or local variables) .
The stack frame usually includes at least the following items (in push order):

the arguments (parameter values)
the return address back to the routine's caller
space for the local variables of the routine (if any).

When a subroutine starts running, the frame pointer and the stack pointer contain the same address.
While the subroutine is active, the frame pointer, points at the top of the stack. (stacks grow downward)


str     fp, [sp, -4]!   @ save caller frame pointer

first determines a memory address by subtracting 4 from the address in the sp register and updating the sp register to this new address.
It then stores the address in the fp register in memory at this new address.