ARM Embedded System

Embedded Systems with ARM Cortex-M Microcontrollers in Assembly Language and C (Third Edition)

preface

The book introduces basic programming of ARM Cortex-M cores in assembly and C at the register level, and the fundamentals of embedded system design.
It presents basic concepts such as data representations (integer, fixed-point, floating-point), assembly instructions, stack, and implementing basic controls and functions of C language at the assembly level.
It covers advanced topics such as interrupts, mixing C and assembly, direct memory access (DMA), system timer (SysTick), multi-tasking, SIMD instructions for digital signal processing (DSP), and instruction encoding/decoding.
The book also gives detailed examples of interfacing peripherals, such as general purpose I/O (GPIO), LCD driver, keypad interaction, stepper motor control, PWM output, timer input capture, DAC, ADC, real-time clock (RTC), and serial communication (USART, I2C, SPI, and USB).

1. See a Program Running

This chapter shows how a program is gnerated and executed.

1.1 Translate a C Program into a Machine Program

Compiliers first perform some analysis on the source program, and then create an intermediate representation(IR).
FOr C programs, the intermediate program is similar to a assembly program.
Finally, the compilers translate the assemble program into a machine program.(binary executable)
The binary machine program follows a standard called executable and linkable format(ELF) which most ARM-based system support .
ELF defines 2 interfaces:
  • a linkable interface
  • used at link time to combine multiple files
  • an executable interface
  • use at run time to create a process image when the program is loaded and executed.
The executable interface provides 2 separate logical views:
  • load view
  • The loa view classfies the input sctions into read-write and read-only regions.
  • execution view
  • The execution view provides information for the processor to load the executable at runtime.
    This depends on 4 critical sections:
    • a test segment
    • a read-only data segment
    • a read-write data segment
    • a zero-initialized data segment

1.2 Load a Machine Program into Memory

1.2.1 Harvard Architecture and Non Neumann Architecture

There are 2 ype of architecture in memory accessing:
Because the data and instruction memory are small enough to fit in the same 32-bit memory address space, they often share the same memory adress bus.
For ex.,256 KB data memory and 4 KB instruction memory can share the address bus,

1.2.2 Creating Runtime memory Image

ARM Cortex-M3/M4/M7 processors are Harvard computer architecture, the instruction memory(flash) and data memory(SRAM) are built into th eprocessor chip.
A simple example shows how the Harvard architecture loads a program to start the execution,
When the processor boots successfully, the 1st instruction of the program is loaded from the instruction memory into the processor, and the program starts to run.

The memory map is pre-defined by the chip manufacture and is not programmable usually.
For ex., an example memory map of the 4 GB memory space:

The processor allocates memory addresses for each internal or external peripherial.
The peripherial has a set of registers and may contain a small memory, the processor maps the register and memory of all peripherials to the same memory addressspace.
To interface a peripherial, the processor uses regular memory access instructions to Read/Wrote pre-defined addresses for this peripherial.
This method is called memory-mapped IO.

1.3 registers

All registers are of the same size and typically hold 16, 32, or 64 bits.
A processor core has 2 types of registers: generail purpose and special purpose registers.

1.3.1 Reusing Registers to Improve Performance

Some data items are accessed more frequently.
Therefore, most compiliers try to place the value of frequently or recently accessed data variables and memory addresses in registers for performance optimization.
Processor architecture design may use caching and prefetching to speed up the performance.

The number of registers on a processor is often small:

  • registers always exhibt the highes temperature
  • instruction's length to encode registers

2. Data Representation

3. ARM Instruction Set Architecture

4. Arithmetic and Logic

5. Load and Store

6. Branch and Conditional Execution

7. Structured Programming

8. Subroutines

9. 64-bit Data Processing

10. Mixing C and Assembly

11. Interrupt

12. Fixed-point and Floating-point Arithmetic

13. Instruction Encoding and Decoding

14. General-purpose I/O

15. General-purpose Timers

16. Stepper Motor Control

17. Liquid-crystal Display (LCD)

18. Real-time Clock (RTC)

19. Direct Memory Access (DMA)

20. Analog-to-Digital Converter (ADC)

21. Digital-to-Analog Converter (DAC)

22. Serial Communication Protocols

23. Multitasking

24. Digital Signal Processing

Appendix A: GNU Compiler

Short Lectures

1. Why use Two's Complement?

2. Carry flag for unsigned addition and subtraction

3. Overflow flag for signed addition and subtraction

4. C Pointer

5 Memory-mapped I/O

This short video explains what is memory mapped I/O.
Usually, each on-chip peripheral device has a few registers, such as control registers, status registers, data input registers, and data output registers.
In general, there are 2 approaches to exchange data between the processor core and a peripheral device:
  • Port-mapped I/O
  • Port mapped I/O uses special CPU instructions which are designed specifically for I/O opeartions, such as the in and out instructions found on microprocessors based on the x86 and x86-64 architectures.
  • Memory-mapped I/O
  • Each device register is assigned to a memory address in the memory address space of the microprocessor.
    The memory and registers of the I/O devices are mapped to (associated with) address values. So a memory address may refer to either a portion of physical RAM, or instead to memory and registers of the I/O device.
    Each I/O device monitors the CPU's address bus and responds to any CPU access of an address assigned to that device, connecting the data bus to the desired device's hardware register.
    To accommodate the I/O devices, some areas of the address bus used by the CPU must be reserved for I/O and must not be available for normal physical memory.
    Memory-mapped I/O is performed by the native load and store instructions of the processor.
    
        LDR/STR Reg, [Reg, #imm]
        
Therefore, memory-mapped I/O is a more convient way to interface I/O devices.

Here is an example of memory mapped I/O.

Suppose we want to set the output of a GPIO pin to high, software can use the store instruction STR to set the corresponding bit in GPIO data output register to 1.
When you write to this special memory location 0x48000014, the data you write is sent to the corresponding I/O device.

The memory address of ARM Cortex-M has a total of 32 bits, supporting 4GB of memory space.
The memory space is divided into six different pre-defined regions.

Each region is given for recommanded usage.
  • The 1st region is code region
  • This is primarily used to store program code.
    It can also store data.
    The code region is on-chip memory, typically on-chip flash.
    The size of on-chip flash is limited to half a GB. The actual size of the on-chip flash varies based on different venders and different chips.
  • The 2nd region is SRAM
  • It is primarily used to store data, such as heaps and stacks.
    We can also put code here.
    It supports half a GB.
  • The 3rd region is peripheral
  • These peripherials include Advanced High Performance Bus peripherials, such as GPIO and ADC, or Advanced Peripherial Bus peripherials, such as timers and UART.
    This region covers the memory address of all on-chip peripherals.
    Specific mapping addresses are dependent on vendors and chips.
  • The 4-th region is for External Device
  • Such as SD card.
  • The 5-th region is External RAM
  • Executable region for data.
    It is off-chip memory, primarily used to store large data blocks.
    It has a total of 1GB.
  • The 6-th region is system region
  • This includes the NVIC, system timer, system control block, and vendor specific memory.
We will use GPIO on STM32L4 as an example to illustrate the concept of memory-mapped I/O.
For ex., on STM32L4, the registers of GPIO Port A, are mapped to a small memory region starting at 0x4800000.
Let's take a closer look at the memory map for GPIO Port A.
Each port has 12 registers, and each register has 4 bytes.
While a total 1KB space is reserved for Port A, only 48 bytes are used.
Within this 48 bytes memory region, the GPIO mode register MODER is mapped to the lowest memory adress, and the GPIO analog switch control register(ASCR) is mapped to the highest memory address.
If we want to set the output of pin#14 of the GPIO port A to high, we need to set bit 14 of the output data register(ODR) of GPIO port A to 1.
The output data register (ODR) of Port A on STM32L4 are mapped to the memory addresses from 0x48000014 to 0x48000017.
If little endian is used, the highest memory address holds the most significant 8 bits, and the lowest memory address holds the least significant 8 bits.
This can be set using the following C statement:
A sequence of load, modify, and store operations are performed in the above C statement*
  • this statement casts the memory address to a memory pointer, which points to an 32-bit unsigned integer.
  • the deference operator retrieves the ODR register value as a 32-bit integer
  • a bit-wise operation is performed to modify this unsigned integer value
  • the updated value is stored back to the ODR register via the deferencing
This memory block of PORT can be represented by using a C struct,
Note that we put volatile qualifier on each register.
When a variavle is declared as volatile, the compiler is informed that even though no statements in the program appear to change it, the value might still change.
Typically, compilers minimize the number of memory accesses , by storing the memory value in a register, and then repeatedly using it without accessing the memory.
The volatile qualifier on a variable prevents the compilier from making such optimization on this variable.

6. GPIO Output: Lighting up a LED

7. GPIO Input: Interfacing joystick

8. LCD Driver

9. Interrupts

This short video will explain how interrupts work on ARM Cortex-M microprocessors. Us the STM32 L4 discovery kit as an ex., there are 2 LEDs and a joystick with 4 push buttons.
If we want to develop a software: if a button is prssed, the software turns on the red LED. There are 2 ways to monitor the logic state on an input pin which is attached to a push button:
  • polling
  • interrupt
  • When the interrupt signal is generated, the processor receives the interrupt then suspend the current execution of programs, and start the execution of a special program called the interrup handler. After the interrupt handler completes, the processor resume the execution of programs.
In the memory address space of ARM Cortex-M, there is a SRAM region.
If the memory address is 32 bits, it can support 4GB of memory space.
The memory space is divided in 6 pre-defined regions and each region has suggested usage:
The internal SRAM is divided into several segments.
  • Initialized data
  • It contains global and static variables, which the program gives some initial values.
  • Zero-initialized data
  • It contains all globall, or static variables, which are uninitialized, or initialized to 0, in the program.
  • Heap
  • It hold data objects, which an application creates dynamically at runtime. It grow upwards.
  • stack
  • It can save the runtime environment, local variables and subroutines, and pass arguments to a subroutine.
    The stack is placed on the top of the internal SRAM memory, and it grows downwards.
The stack and the heap are located at the opposite end of the free memory region.
They grow in the opposite direction.
When the stack meets the heap, free memory space is exhausted.

While the code space can have as large as half a GB in the address space, much of this space is reserved.
For ex., STM32L4 chip has only 1MB on-chip flash memory which starts at 0x08000000 and ends at 0x080FFFF.
in addition, a small flash memory region starting at 0x08000000 is mapped to the lowest memory region starting at the address 0.

This mapping region includes the initial value for the main stack pointer, and the interrupt vector table.
The Nested Vector Interrupt Controller(NVIC) prioritizes and handle all interrupts.

When we press the push button connected to the pin PA3, the HW generates an electrical signal, called interrupt request, EXTI3.
When NVIC receives the interrupt request, it forces the processor to jump to and execute a special piece of code, called an interrupt service routine or an interrupt handler.
The entry points of all interrupt service routines are stored in a special table, called an interrupt vector table.
The interrupt vector table is stored at a pre-defined area in the memory.
For ARM Cortex processors, the interrupt vector table starts at the memory address 0x0004.
By default, the interrupt vector table is mapped to the lowest address of the internal flash memory.
However, software can re-map it to a different location, such as internal SRAM.

The interrupt vector table holds an array of memory addresses.
Each address is the starting address of the interrupt service routine.
The interrupt number is used to index the interrupt table.

The reset handlerr contains the function pointer which is called when the processor is rest.
When the processor is in reset, the program counter is initialized to this address value.
Typically, the reset handler performs some HW initialization, then calls the main function.
If the interrupt arrives, the interrupt controller will read the address of the interrupt handler which is stored in the IVT. Then, set the program counter to that value.
This way forces the processor to jump to the ISR.

Before jumpping to the ISR, the interrupt controller perform stacking to reserve the program's status.
Note, ARM uses decending stack, if a 32 bits item is pushed to the stack, the SP(stack pointer) is decremented by 4.

The ISR completes its execution by execute this instruction:

BX LR
The above instruction informs the interrupt controller to perform an unstacking process.

10. Interrupt Enable and Interrupt Priority

A Cortex M microcontroller support up to 256 interrupts.
  • Each interrupt, except the interrupt reset, has an interrupt number.
  • The first 16 interrupts are system interrupts, also called system exceptions.
  • CMSIS(Cortex Microcontroller Software Interface Standard) defines all system exceptions by using negative values.
  • The reset 240 interrupts are peripherial interrupts, also called non-system exceptions
  • The peripheral interrupt number starts with 0.
    Peripherial interrupts are defined by chip manufactures.
    The total number of peripherial interrupts supported varies among chips.
Several CMSIS use the interrupt number as an input parameter,

NVIC_DisableIRQ(IRQn);            // disable interrupt
NVIC_EnableIR(IRQn);              // enable interrupt
NVIC_ClearingPending(IRQn);       // clear pending status
NVIC_SetPriority(IRQn, priority); // set priority level
When an interrupt is serviced, the current interrupt or exception number is recorded in the program status register(PSR).
The recorded value in PSR is different from the number in CMSIS,
In this tutorial, when we say interrupt number, we mean the interrupt number defined for CMSIS.
This is the interrupt number definition for STM32L4 Cortex-M4 microprocessors, it is always defined in a header file:
Enabling a system exception is different from enabling a peripherial interrupt.
There is no enabling/disabling rsgisters for system excptions:
  • Some system exceptions, such as reset and hard fault, cannot be disabled. They are always enabled.
  • The other system exceptions can be enabled or disabled by the corresponding modules, such as system timer

On the other hand, the enabling/disabling peripherial interrupts are implemented by modifying 2 sets of registers: ISER(interrupt set enable register) and ICER registers.
We can enable a peripherial interrupt by writing 1 to the corresponding bit of the ISER register.
For ex., to enable interrupt Timer 7,

  • the interrupt number of Timer 7 is 44 for STM32L1
  • we need to set bit 12 of ISER1 to 1
Similarly, we can disable interrupt Timer 7,

What should the processor do if multiple interrupts arrive at the same time?
ARM processor allow software to set priority levels for almost every interrupt.
In ARM, numerically low priority values are used to specify logically high interrupt priorities.
The priority of some interrupts are fixed.

ARM Cortex-M processors use a byte to represent the priority level.
Interrupt priority is configured by Interrupt Priority(IP) register.

In embedded systems, we often have to perform some critical operations, in which data should not be corrupted by other interrupts.
Therefore, we need to disable all interrupts with less urgency to ensure that the execution of the critical code will not be interrupted by other interrupts.
We can use the Base Priority Mask Register(BASEPRI) to achieve the protection of critical code.
In this ex., we disable all interrupts whose priority is >= 5 during the execution of the critical code.


__set_BASEPRI( 5 << 4 )
// critical code start
..
// critical code end
__set_BASEPRI(0)

11. External interrupts (EXTI)

This lecture will show you how to configure and program external interrupt(EXTI).

External interrupts are generated by peripherals or devices, external to the microcontroller, such as push buttons and key pads.
There are 2 approaches to monitor and respond to external events.

  • polling
  • interrupt
An interrupt is essentially a HW-triggered SW action.
The interrupt controller :
  • temporally stops the normal flow of program execution
  • causes the interrupt service routine(ISR) to be executed
  • After the ISR completes, normal program execution is resumed at the point where is was last time.
When there are no interrupt events, the processor runs the normal program or enters a sleep state to conserve energy.

Use STM32L4 Kit as an ex.,

GPIO port A's pins PA0, PA1, PA5, PA2 and PA3 are connected to the "center", "left", "down", "right", and "u"p pin of the jpystick respectively.
Each ping is connected to the ground via a capacitor.
These capacitors perform HW switch debouncing.
When the "up" of the joystick is pressed, this switch is then closed.
As a result, PA#3 is then connected to the 3V via the "COMMON" terminal.
Note that:
  • the default voltage of the "CENTER" pin is 0 because of the pull down register R59.
  • The other 4 joystick terminals are not pull down.
  • Their defailt voltage may be high or low depending on its last usage
We can use external interrupts to monitor whether the joystick is pressed or not.
Each GPIO pin can trigger an interrupt request signal independently.
SW can configure the external interrupt controller so that:
  • PA#0 triggers EXTI0
  • PA#1 triggers EXTI1
  • PA#5 triggers EXTI5
  • PA#2 triggers EXTI2
  • PA#3 triggers EXTI3
How to configure the source of the external interrupt controller?
The external interrupt controller monitors the change of the voltage signal.
The rising or falling edge of the voltage signal can make the external interrupt controller generate an interrupt request.
The interrupt request will be sent to the NVIC.
The external interrupt controller supports 16 external interrupt input, these inputs are named from external interrupt 0 to 15 and associated with GPIO pins.
Each interrupt input is associated with one specific GPIO port's pin.
Multiple GPIO port's pins can be used as the input interrupt source simultaneously.
Therefore, we can use only specific GPIO pin number from GPIO ports at the same time.
The interrupt controller has one multiplexer for each GPIO pin. There are 16 multiplexers.
A multiplexer(MUX) is a simple circuit. It selects one of its inputs and forwards it ti the output.
There are dedicated interrupt handlers for external interrupts.
For ex.,
  • PA.3 can be mapped to EXTI3 and its corresponding interrupt handler is EXT_3_IRQHandler.
  • External interrupts from number 5 to 0, share the same interrupt handler EXT_9_5_IRQHandler
  • External interrupts from number 10 to 15, share the same interrupt handler EXT_15_10_IRQHandler
The external interrup controller supports 2 types of interrupts:
  • configurable external interrupts
  • Interrupt associated with GPIO, RTC, comparators, power voltage detector and peripherial voltage monitoring(PVM).
    For these interrupts, the controller has a programmable edge detector, and software can select which active edge generate an interrupt request.
    Besides, SW can generate an interrupt request by writing 1 to the SW interrupt event register(SWIER).
  • direct external interrupts
  • Only rsing edge can generate an interrupt request.
    These interrupts are mostly used for communication peripherials, low-power timer, and LED.
An interrupt can pass this AND gate if and only if the bit from the Interrupt Mask Register(IMR) register is 1.

Let's work on the SW part: if we press the "UP" button of th ejoystick, SW turns on the LED
The "UP" butto is connected to the GPIO pin PA3 which can generate the external interrupt request 3.

  1. First, we need to enable the GPIO port A.
  2. 
    RCC->AHB2ENR |= RCC_AHB2ENR_GPIOAEN;
    		
  3. Then, configure the mode of pin PA.3 as the digital input.
  4. 
    // GPIO mode: digital input(00), digital output(01), alternative function(10), analog(11, default).
    GPIOA->MODER &= ~3U << 6;
    		
  5. Set PA.3 as pull down
  6. GPIO PA.3 is neither pulled up nor pulled down externally. It is connected to the ground via a capacitor.
    If the processor doesn't pull it down internally, the voltage on pin PA.3 is floating.
    
    // GPIO non pull-up , pull-down(00), pull-up(01), pull-down(10), reserved(11)
    GPIO->PUPDR &= ~3U << 6;
    GPIO->PUPDR |= 2U << 6;     // pull-down(10)
            
  7. enable external interrupt 3
  8. 
    NVIC_EnableIRQ(EXTI3_IRQn);
    		
  9. select PA.3 as the source of external interrupt 3
  10. 
    RCC->APB2ENR != RCC_APB2ENR_SYSFGGEN;
    SYSCFG->EXTICR[0] &= ~SYSCFG_EXTICR1_EXTI3;
    SYSCFG->EXTICR[0] |= SYSCFG_EXTICR1_EXTI3_PA;
    		
    When PA.3 is selected, the other port's pin 3 cannot be used to generate external interrupts.
  11. rising edge trigger selection
  12. 
    // 0: trigger disabled, 1: trigger enabled
    EXTI->RTSR1 != EXTI->RTSR1_RT3;
    		
  13. set the interrupt mask register
  14. 
    // 0: masked, 1: not masked
    EXTI->IMR1 != EXTI->IMR1_IM3;
    		
  15. ISR for external interrupt 3
  16. After receiving the interrupt request, the NVIC controller forces the processor to execute interrupt handler EXTI_IRQHandler().
    
    void EXTI3_IRQHandler(void){
        if ((EXTI->PR1 & EXTI_PR1_PIF3) != 0) {
            // toggle LED
            ..
            // clear interrupt flag
            EXTI->PR1 |= EXTI_PR1_PIF3;
        }
    }
    		

12. System Timer (SysTick)

13. Timer PWM output

14. Timer Input Capture

15. Booting Process

16. Volatile Variables

17. Race Condition

18. ADC

19. Floating-Point Unit (FPU)

20. Fixed Point Numbers

21. Why learn assembly language

22. Big Endian and Little Endian

23. Load and Store Instructions

24. Addressing mode: pre-index, post-index, and pre-index with update

25. Arithmetic and Logical Instructions

26. Updating NZCV bit flags

27. Branch Instructions

28. Conditional Execution

29. Calling a subroutine

30. Passing arguments to a subroutine

31. Preserving registers in a subroutine

32. Mixing C and assembly


SoC, MPU, MCU

Microcontrollers vs. Microprocessors: What’s the difference?

Microcontrollers (MCUs) tend to be less expensive than, simpler to set-up, and simpler to operate than microprocessors (MPUs).
An MCU can be viewed as a single-chip computer, whereas an MPU has surrounding chips that support various functions like memory, interfaces, and I/O.

One of the main differences between microcontrollers and microprocessors is that

  • a microprocessor will typically run an operating system.
  • An operating system allows multiple processes to run at the same time via multiple threads. Drivers are required to support peripherals.
  • A microcontroller will run a “bare metal interface,” which means there is not an operating system.
  • Without an operating system, a microcontroller can only run one control loop at a time.
    From a software perspective, this means a single thread is running on the microcontroller’s processor or Central Processing Unit (CPU).
MCUs only have basic options for interfacing with the outside world.
An MCU might have I2C, SPI, a UART (serial), and sometimes a low-level USB connection.
These basic interfaces are often used just for programming the MCU.

An MCU provides more on a single chip than an MPU.
The difference between MCUs and MPUs is becoming less pronounced since some MCUs now come with simple software drivers for more sophisticated peripherals and more MPUs can be found that have integrated peripherals on-chip.

SoC

An SoC( System-on-a-Chip ) can be based on an MCU or MPU and will provide everything that’s necessary to perform certain types of applications.
SoCs enable an entire system of chips on a single, tiny IC.
For example, for image processing, an SoC might have a combination of
  • MPU
  • a Digital Signal Processor (DSP)
  • a Graphic Processing Unit (GPU) for performing rapid algorithm calculations, along with on-chip interfaces for driving a display and an HDMI or other audio/video input/output technology.

ARM Instruction Set

基本上 ARM 處理器具有 16 個 32 bit 長度的暫存器,其中有 13 個為 通用暫存器 (General Purpose Registers, GPRs) , R13-R15 則有其他用途。
  • R13
  • 通常會被用來當作堆疊指標 (Stack Pointer, SP),在實際使用中,一般會在記憶體分配一些空間作為堆疊,系統初始化時將這一塊堆疊的底部位址儲存到 R13 。
  • R14
  • 為 連結暫存器 (Link register, LR) ,用來存放副程式的返回地址,比如我們在組語中呼叫到了 BL、BLX 等指令時,會將 PC 的數值複製到 R14 中,作為反還 (return) 的位址
  • R15
  • 則是程式計數器(Program Counter, PC),用來存放下一道指令的位址
學組語的目的,不見得是為了改善效能,而是判斷 optimizing compiler 產生的機械碼是否正確
Basic Syntax

label
    opcode operand1, operand2, ...; Comments
  • lable
  • 可有可無,通常用來當作地址的標記
  • opcode
  • 指令的操作碼
  • operand
  • 第一個operand是指令結果的destination,不同指令則有所不同個operand
ARM programmer model
  • The state of an ARM system is determined by the content of visible registers and memory.
  • A user-mode program can see 15 32-bit general- purpose registers (R0-R14), program counter (PC) and CPSR.
  • Instruction set defines the operations that can change the state.
An Instruction Set Architecture (ISA) is part of the abstract model of a computer that defines how the CPU is controlled by the software.
The ISA defines the supported data types, the registers, how the hardware manages main memory, key features (such as virtual memory), which instructions a microprocessor can execute, and the input/output model of multiple ISA implementations.
ARM instructions are all 32-bit long (except for Thumb mode).
There are 232 possible machine instructions. Fortunately, they are structured.
Regarding registers, briefly:
  • r0
  • Return value, first function argument
  • r1-r3
  • Function arguments and general scratch
  • r4-r11
  • Saved registers
  • r12
  • ip. Intra-procedure scratch register, rarely used by the linker
  • r13
  • sp. Stack pointer, a pointer to the end of the stack. Moved by push and pop instructions.
  • r14
  • lr. Link register, storing the address to return to when the function is done. Written by "bl" (branch and link, like function call), often saved with a push/pop sequence, read by "bx lr" (branch to link register) or the pop.
  • r15
  • pc. Program counter, the current memory address being executed. It's very unusual, but handy, to have the program counter just be another register--for example, you can do program counter relative addressing very easily, by just loading from [pc+addr].
Instruction set:
  • Data processing
  • They are move, arithmetic, logical, comparison and multiply instructions.
  • Data movement
  • Flow control

C6: A64 Base Instruction Descriptions

C6.2.173 MRS(Move System Register)

To read an AArch64 System register into a general-purpose register.

C6.2.175 MSR (register)

To write an AArch64 System register from a general-purpose register.

What is the purpose of WFI and WFE instructions and the event signals?

We have 2 instructions for entering low-power standby state where most clocks are gated: WFI and WFE.
  • WFI is targeted at entering either standby, dormant or shutdown mode, where an interrupt is required to wake-up the processor.
  • WFE makes use of the event register, the SEV instruction and EVENTI, EVENTO signals.
  • A usage for WFE is to put it into a spinlock loop.
    Where a CPU wants to access a shared resource such as shared memory, we can use a semaphore flag location managed by exclusive load and store access.
    If multiple CPUs are trying to access the resource, one will get access and will start to use the resource while the other CPUs will be stuck in the spinlock loop.
    To save power, you can insert the WFE instruction into the loop so the CPUs instead of looping continuously will enter STANDBYWFE.
    Then the CPU who has been using the resource should execute SEV instruction after it has finished using the resource.
    This will wake up all other CPUs from STANDBYWFE and another CPU can then access the shared resource.

RASPBERRY PI ON QEMU


Emulate Raspberry Pi 3 using QEMU in 64 bit


學習實作小型作業系統


Low-Level Programming University


ARM Cortex-A Series Programmer's Guide for ARMv7-A

Preface

The purpose of this book is to provide a single guide for programmers who want to develop applications for the Cortex-A series of processors, bringing together information from a wide variety of sources that will be useful to both assembly language and C programmers.
Hardware concepts such as caches and Memory Management Units are covered, but only where this is valuable to the application writer.
We will also look at the way operating systems such as Linux make use of ARM features, and how to take full advantage of the capabilities of the ARM processor, in particular writing software for multi-core processors.

This is not an introductory level book. It assumes some knowledge of the C programming language and microprocessors, but not of any ARM-specific background.
We hope that the book is suitable for programmers who have a desktop PC or x86 background and are taking their first steps into the ARM processor based world.

Chapter 1 Introduction

Chapter 2 ARM Architecture and Processors

Chapter 3 ARM Processor Modes and Registers

The ARM architecture is a modal architecture.
Before the introduction of Security Extensions it had seven processor modes: six privileged modes and a non-privileged user mode.
  • User (USR)
  • Mode in which most programs and applications run
  • FIQ
  • Entered on an FIQ interrupt exception
  • IRQ
  • Entered on an IRQ interrupt exception
  • Supervisor (SVC)
  • Entered on reset or when a Supervisor Call instruction (SVC) is executed
  • Abort (ABT)
  • Entered on a memory access exception
  • Undef (UND)
  • Entered when an undefined instruction executed
  • System (SYS)
  • (kernel) Mode in which the OS runs, sharing the register view with User mode
Privilege is the ability to perform certain tasks that cannot be done from User (Unprivileged) mode.
For ex., the user mode cannot do MMU configuration and cache operations.
Modes are associated with exception events, which are described in Exception Handling.

The introduction of the TrustZone Security Extensions created two security states for the processor that are independent of Privilege and processor mode, with a new Monitor mode to act as a gateway between the Secure and Non-secure states and modes existing independently for each security state.

For processors that implement the TrustZone extension, system security is achieved by dividing all of the hardware and software resources for the device.
When a processor is in the Non-secure state, it cannot access the memory that is allocated for Secure state.
In this situation the Secure Monitor acts as a gateway for moving between these two worlds. Software executing in Monitor mode controls transition between Secure and Non-secure processor states.

The ARMv7-A architecture Virtualization Extensions add a hypervisor mode (Hyp), in addition to the existing privileged modes.
Virtualization enables more than one Operating System to co-exist and operate on the same system.

If the Virtualization Extensions are implemented there is a privilege model.
  • PL0
  • Software executing at PL0 can make only unprivileged memory accesses.
  • PL1
  • Software execution in all modes other than User mode and Hyp mode is at PL1.
    Normally, operating system software executes at PL1.
  • PL2
  • Hyp mode is normally used by a hypervisor, that controls, and can switch between Guest Operating Systems that execute at PL1.
In Non-secure state there can be three privilege levels, PL0, PL1 and PL2.

These privilege levels are separate from the TrustZone Secure and Normal (Non-secure) settings.
The privilege level defines the ability to access resources in the current security state, and does not imply anything about the ability to access resources in the other security state.

The presence of particular processor modes and states depends on whether the processor implements the relevant architecture extension(Virtualization, TrustZone)
The current processor mode and execution state is contained in the Current Program Status Register (CPSR).

Chapter 4


Generic Interrupt Controller (GIC)

A Generic Interrupt Controller (GIC) takes interrupts from peripherals, prioritizes them, and delivers them to the appropriate processor core.
The Arm GIC architecture has three forms in general use with the A-profile and R-profile processors.

1. Introduction

Terminology

About the Generic Interrupt Controller architecture

The GIC is a centralized resource for supporting and managing interrupts in a system that includes at least one processor.
It provides registers for managing interrupt sources, interrupt behavior, and interrupt routing to one or more processors.

The GIC includes interrupt grouping functionality that supports:

  • configuring each interrupt as either Group 0 or Group 1
  • signaling Group 0 interrupts to the target processor using either the IRQ or the FIQ exception request
  • signaling Group 1 interrupts to the target processor using the IRQ exception request only
  • a unified scheme for handling the priority of Group 0 and Group 1 interrupts
  • optional lockdown of the configuration of some Group 0 interrupts.

Security Extensions support

Virtualization support

Terminology

2. GIC Partitioning

About GIC partitioning

The GIC architecture splits logically into a Distributor block and one or more CPU interface blocks.
The GIC Virtualization Extensions add one or more virtual CPU interfaces to the GIC.
  • Distributor
  • The Distributor block performs interrupt prioritization and distribution to the CPU interface blocks that connect to the processors in the system.
    The Distributor block registers are identified by the GICD_ prefix.
  • CPU interfaces
  • Each CPU interface block performs priority masking and preemption handling for a connected processor in the system.
    CPU interface block registers are identified by the GICC_ prefix.
  • Virtual CPU interfaces
  • The GIC Virtualization Extensions add a virtual CPU interface for each processor in the system.
    Each virtual CPU interface is partitioned into the following blocks:
    • Virtual interface control
    • The main component of the virtual interface control block is the GIC virtual interface control registers, that include a list of active and pending virtual interrupts for the current virtual machine on the connected processor.
      Typically, these registers are managed by the hypervisor that is running on that processor.
      Virtual interface control block registers are identified by the GICH_ prefix.
    • Virtual CPU interface
    • Each virtual CPU interface block provides physical signaling of virtual interrupts to the connected processor.
      The ARM processor Virtualization Extensions signal these interrupts to the current virtual machine on that processor.
      The GIC virtual CPU interface registers, accessed by the virtual machine, provide interrupt control and status information for the virtual interrupts.
      Virtual CPU interface block registers are identified by the GICV_ prefix.

The Distributor

The Distributor centralizes all interrupt sources, determines the priority of each interrupt, and for each CPU interface forwards the interrupt with the highest priority to the interface, for priority masking and preemption handling.

Interrupts from sources are identified using ID numbers. Each CPU interface can see up to 1020 interrupts.

CPU interfaces

Each CPU interface block provides the interface for a processor that is connected to the GIC.

3. Interrupt Handling and Prioritization

4. Programmers' Model

This chapter describes the Distributor and CPU interface registers.

The programmers' model for the GIC Distributor and CPU interfaces is to operate using a memory-mapped register interface.

About the programmers' model

GIC register names

Distributor register map

CPU interface register map

GIC register access

Enabling and disabling the Distributor and CPU interfaces

Effect of the GIC Security Extensions on the programmers' model

GICv3 and GICv4 Software Overview

1. Preface

1.3 Terms and Abbreviations

  • ARE
  • Affinity Routing Enable
  • PE
  • The term Processing Element or PE is used as a generic term for a machine that implements the ARM architecture.
    For the ARM® Cortex®-A57 MPCore as an ex., it can be up to 4 cores. Each core is what the architecture specifcations refer to as a PE.

2. Introduction

2.4 Legacy support

The programmers’ model that is used is controlled by the Affinity Routing Enable (ARE) bits in GICD_CTRL :
  • When ARE == 0, affinity routing is disabled (legacy operation).
  • When ARE == 1, affinity routing is enabled.
This documents focusses on the new GICv3 programmers’ model, where ARE=1 for both security.

3. GICv3 fundamentals

3.1 Interrupts types

3.1.3 How interrupts are signaled to the interrupt controller
  • Traditionally, interrupts are signaled from a peripheral to the interrupt controller using a dedicated hardware signal.
  • GICv3 supports message-based interrupts.
  • A message-based interrupt is an interrupt that is set and cleared by a write to a register in the interrupt controller.
    Using a message to forward the interrupt from a peripheral to the interrupt controller removes the requirement for a dedicated signal per interrupt source.

3.3 Affinity routing

GICv3 uses affinity routing to identify connected PEs and to route interrupts to a specific PE or group of PEs.
The affinity of a PE is represented as four 8-bit fields:

<affinity level 3>.<affinity level 2>.<affinity level 1>.<affinity level 0>
The affinity scheme matches that used in ARMv8-A, with the affinity of a PE reported in MPIDR_EL1.
System designers must ensure that the affinity value indicated by MPIDR_EL1 is identical to that indicated by GICR_TYPER for the Redistributor connected to the PE.

The exact meaning of the different levels of affinity is defined by the specific processor and SoC.
For ex.,

  • 
    <group of groups> . <group of processors> .<processor> .<core>    
        	
  • 
    <group of processors> .<processor> .<core> .<thread> 
        	

3.4 Security model

3.5 Programmers’ model

The register interface of a GICv3 interrupt controller is split into three groups:
  • Distributor interface(GICD_*).
  • Redistributor interface(GICR_*).
  • CPU interface(ICC_*_ELn).
  • In GICv3 the CPU Interface registers are accessed as System registers (ICC_*_ELn).

Generic Timer

The Generic Timer provides a standardized timer framework for Arm cores.
The Generic Timer includes a System Counter and set of per-core timers,
The System Counter is an always-on device, which provides a fixed frequency incrementing system count.
The system count value is broadcast to all the cores in the system, giving the cores a common view of the passage of time.
Each core has a set of timers.
These timers are comparators, which compare against the broadcast system count that is provided by the System Counter.
Each timer has the following three system registers:
For example, CNTP_CVAL_EL0 is the Comparator register of the EL1 physical timer.

The CNTPCT_EL0 system register reports the current system count value.
CNTFRQ_EL0 reports the frequency of the system count. However, this register is not populated by hardware.

Timer virtualization

Timers can be divided into two groups: virtual timers and physical timers.

  • Physical timers
  • Like the EL3 physical timer, CNTPS, compare against the count value provided by the System Counter.
    This value is referred to as the physical count and is reported by CNTPCT_EL0.
  • Virtual timers
  • Like the EL1 Virtual Timer, CNTV, compare against a virtual count.
    The virtual count is calculated as:
    
        Virtual Count = Physical Count - <offset>
        
    The offset value is specified in the register CNTVOFF_EL2, which is only accessible at EL2 or EL3.
    If EL2 not implemented, the offset is fixed as 0. This means that the virtual and physical count values are always the same.
The virtual count allows a hypervisor to show virtual time to a Virtual Machine (VM).
This means that the virtual count can represent time experienced by the VM, rather than wall clock time.

System Counter

The System Counter generates the system count value that is distributed to all the cores in the system.
This means that all cores share the same view of the passing of time.
Consider the following example:
  • Device A reads the current system count and adds it to a message as a timestamp, then sends the message to Device B.
  • When Device B receives the message, it compares the timestamp to the current system count.
In this example, the system count value that is seen by Device B can never be earlier than the timestamp in the message.

The System Counter measures real time.
The count must continue to increment at its fixed frequency.
The System Counter provides two register frames: CNTControlBase and CNTReadBase.


Registers

To download AArch64-Reference-Manual. This document contains the detailed specification of the ARM.v8 architecture.

CNTFRQ_EL0, Counter-timer Frequency register

This register is provided so that software can discover the frequency of the system counter.
It must be programmed with this value as part of system initialization.
The value of the register is not interpreted by hardware.

CNTFRQ_EL0 is a 64-bit register.

AArch64 System register CNTFRQ_EL0 bits [31:0] are architecturally mapped to AArch32 System register CNTFRQ[31:0].
Bits [31:0] ndicates the system counter clock frequency, in Hz.

CNTPCT_EL0, Counter-timer Physical Count register

This holds the 64-bit physical count value.

CNTVCT_EL0, Counter-timer Virtual Count register

This holds the 64-bit virtual count value.
The virtual count value is equal to the physical count value visible in CNTPCT_EL0 minus the virtual offset visible in CNTVOFF_EL2.
This register can be read using MRS with the following syntax:

MRS <Xt>, <systemreg>

CNTVOFF_EL2, Counter-timer Virtual Offset register

This holds the 64-bit virtual offset.
This is the offset between the physical count value visible in CNTPCT_EL0 and the virtual count value visible in CNTVCT_EL0.

MRS <Xt>, <systemreg>

MIDR, Main ID Register

Provides identification information for the PE, including an implementer code for the device and a device ID number.
There is one instance of this register that is used in both Secure and Non-secure states.
Some fields of the MIDR are IMPLEMENTATION DEFINED.
  • Implementer, bits [31:24]
  • This field must hold an implementer code that has been assigned by ARM.
    For ex., NVIDIA uses 0x4E.
  • Variant, bits [23:20]
  • An IMPLEMENTATION DEFINED variant number.
  • Architecture, bits [19:16]
  • PartNum, bits [15:4]
  • An IMPLEMENTATION DEFINED primary part number for the device.
  • Revision, bits [3:0]
  • An IMPLEMENTATION DEFINED revision number for the device.

System Control Register (SCTLR)

The SCTLR provides the top level control of the system, including its memory system.
  • EE, bit [25]
  • The value of the PSTATE.E bit on branch to an exception vector or coming out of reset, and the endianness of stage 1 translation table walks in the PL1&0 translation regime.
    The possible values of this bit are:
    • 0
    • Little-endian. PSTATE.E is cleared to 0 on taking an exception or coming out of reset.
      Stage 1 translation table walks in the PL1&0 translation regime are little-endian.
    • 1
    • Big-endian. PSTATE.E is set to 1 on taking an exception or coming out of reset.
      Stage 1 translation table walks in the PL1&0 translation regime are big-endian.
  • I, bit [12]
  • Instruction access Cacheability control, for accesses at EL1 and EL0:
    • 0
    • All instruction access to Normal memory from PL1 and PL0 are Non-cacheable for all levels of instruction and unified cache.
      If the value of SCTLR.M is 0, instruction accesses from stage 1 of the PL1&0 translation regime are to Normal, Outer Shareable, Inner Non-cacheable, Outer Non-cacheable memory.
    • 1
    • All instruction access to Normal memory from PL1 and PL0 can be cached at all levels of instruction and unified cache.
      If the value of SCTLR.M is 0, instruction accesses from stage 1 of the PL1&0 translation regime are to Normal,
  • C, bit [2]
  • Cacheability control, for data accesses at EL1 and EL0:
    • 0
    • All data access to Normal memory from PL1 and PL0, and all accesses to the PL1&0 stage 1 translation tables, are Non-cacheable for all levels of data and unified cache.
    • 1
    • All data access to Normal memory from PL1 and PL0, and all accesses to the PL1&0 stage 1 translation tables, can be cached at all levels of data and unified cache.
  • M, bit [0]
  • MMU enable for EL1 and EL0 stage 1 address translation.
    Possible values of this bit are:
    • 0
    • EL1 and EL0 stage 1 address translation disabled. See the SCTLR.I field for the behavior of instruction accesses to Normal memory.
    • 1
    • EL1 and EL0 stage 1 address translation enabled.

SCTLR_EL1, System Control Register (EL1)

Provides top level control of the system, including its memory system, at EL1 and EL0.

AArch64 System register SCTLR_EL1 bits [31:0] are architecturally mapped to AArch32 System register SCTLR[31:0].

  • DSSBS, bit [44]
  • Default PSTATE.SSBS value on Exception Entry.
    • When FEAT_SSBS is implemented
      • 0
      • PSTATE.SSBS is set to 0 on an exception to EL1.
      • 1
      • PSTATE.SSBS is set to 1 on an exception to EL1.
    • Otherwise
    • Reserved, RES0.

SSBS, Speculative Store Bypass Safe

This register is present only when FEAT_SSBS is implemented. Otherwise, direct accesses to SSBS are UNDEFINED.

HCR_EL2, Hypervisor Configuration Register (EL2)

Provides configuration controls for virtualization, including defining whether various Non-secure operations are trapped to EL2.
  • RW, bit [31]
  • Execution state control for lower Exception levels:
    • 0
    • Lower levels are all AArch32.
    • 1
    • The Execution state for EL1 is AArch64.
      The Execution state for EL0 is determined by the current value of PSTATE.nRW when executing at EL0.

SCR_EL3, Secure Configuration Register (EL3)

Defines the configuration of the current Security state. It specifies:
  • The Security state of EL0 and EL1, either Secure or Non-secure.
  • The Execution state at lower Exception levels.
  • Whether IRQ, FIQ, SError interrupts, and External abort exceptions are taken to EL3.
  • RW, bit [10]
  • Execution state control for lower Exception levels.
    • 0
    • Lower levels are all AArch32.
    • 1
    • The next lower level is AArch64.
      • If EL2 is present:
        • EL2 is AArch64.
        • EL2 controls EL1 and EL0 behaviors.
      • If EL2 is not present:
        • EL1 is AArch64.
        • EL0 is determined by the Execution state described in the current process state when executing at EL0.
  • Bits [5:4]
  • Reserved, RES1.
  • NS, bit [0]
  • Non-secure bit.
    • 0
    • Indicates that EL0 and EL1 are in Secure state, and so memory accesses from those Exception levels can access Secure memory.
      When executing at EL3:
      • The AT S1E2R, AT S1E2W, TLBI VAE2, TLBI VALE2, TLBI VAE2IS, TLBI VALE2IS, TLBI ALLE2, and TLBI ALLE2IS System instructions are UNDEFINED.
      • Each AT S12E** System instruction executes as the corresponding AT S1E**instruction.
      • For example, AT S12E0R executes as AT S1E0R.
      • Each of the TLBI IPAS2E1, TLBI IPAS2E1IS, TLBI IPAS2LE1, and TLBI IPAS2LE1IS System instructions executes as a NOP.
      • A TLBI VMALLS12E1 System instruction executes as TLBI VMALLE1, and a TLBI VMALLS12E1IS System instruction executes as TLBI VMALLE1IS.
    • 1
    • Indicates that EL0 and EL1 are in Non-secure state, and so memory accesses from those Exception levels cannot access Secure memory.

SPSR_EL3, Saved Program Status Register (EL3)

Holds the saved process state when an exception is taken to EL3.

ACTLR, Auxiliary Control Register

AArch32 System register ACTLR provides IMPLEMENTATION DEFINED configuration and control options for execution at EL1 and EL0.
ACTLR is a 32-bit register, and is part of:
  • The Other system control registers functional group.
  • The Implementation defined functional group.

ACTLR_EL1, Auxiliary Control Register (EL1)

Provides IMPLEMENTATION DEFINED configuration and control options for execution at EL1 and EL0.
ACTLR_EL1 is a 64-bit register

ACTLR_EL2, Auxiliary Control Register (EL2)

Provides IMPLEMENTATION DEFINED configuration and control options for EL2.

ACTLR_EL3, Auxiliary Control Register (EL3)

Provides IMPLEMENTATION DEFINED configuration and control options for EL3.
ACTLR_EL3 is a 64-bit register.

MPIDR_EL1, Multiprocessor Affinity Register, EL1

The MPIDR_EL1 provides an additional core identification mechanism for scheduling purposes in a cluster.
Configuration of what a processing element (PE) is in an ARM core or cluster is defined by the MPIDR system register.
The format of this is as follows (for AArch64):
The MPIDR_EL1 enables software to determine on which core it is executing.
This register has a different value for each processing element in the system.

  • RES0, [63:40]
  • Reserved.
  • Aff3, [39:32]
  • Affinity level 3. Highest level affinity field.
  • RES1, [31]
  • Reserved
  • U, [30]
  • Indicates whether this is a single core or a multi-core cluster.
    0 means core is part of a multiprocessor system. This is the value for implementations with more than one core, and for implementations with an ACE or CHI master interface.
  • [29:25]
  • Reserved.
  • MT, [24]
  • Indicates whether the lowest level of affinity consists of logical cores that are implemented using a multithreading type approach.
  • Aff2, [23:16]
  • Aff1, [15:12]
  • Part of Affinity level 1.
    Read-As-Zero.
  • Aff1, [11:8]
  • Part of Affinity level 1. CPUID.Identification number for each CPU in the Cortex-A75 cluster:
    • 0x0
    • MP1: CPUID: 0
    • ...
    • 0x7
    • MP8: CPUID: 7
  • Aff0, [7:0]
  • Affinity level 0.
    The level identifies individual threads within a multithreaded core.
    The Cortex-A75 core is single-threaded, so this field has the value 0x00.
Physical CPU can have several cores, a CPU core is a physical prosessing unit.
各個core之間是相互獨立,且可以並行執行邏輯的,每個core都有自己單獨的暫存器,l1, l2 快取等物理硬體。
intel又在core的基礎上提出了hyper-threading概念,即一個core裡可以模擬多個邏輯核,這個就叫做thread
Thread is a logical processing unit which is implemented by software logic.
The affinity fields give a hierarchical description of the core's location relative to other cores.
Typically,
  • Affinity 0 is the core ID within the cluster
  • Affinity 1 is the cluster ID.

    // 读取当前CPUID,如果id不为0(primary core),使其跳至halt休眠
    // mrs -- Move the contents of a special register to a general-purpose register.
    // mpidr_el1 用来读取核心ID用
    mrs     x1, mpidr_el1
    and     x1, x1, #0xFF // CPU number is in MPIDR Affinity Level 0
    cbnz    x1, halt // Hang for all non-primary CPU

arch/arm64/include/asm/sysreg.h



#define read_sysreg_s(r) ({						\
	u64 __val;							\
	asm volatile(__mrs_s("%0", r) : "=r" (__val));			\
	__val;								\
})

arch/arm64/include/asm/cputype.h


#define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)

arch/arm/include/asm/cputype.h



#define CPUID_MPIDR	5

static inline unsigned int __attribute_const__ read_cpuid_mpidr(void)
{
	return read_cpuid(CPUID_MPIDR);
}

ARM GCC Inline Assembler Cookbook

The GNU C compiler for ARM RISC processors offers, to embed assembly language code into C programs.

GCC asm statement

With inline assembly you can use the same assembler instruction mnemonics as you'd use for writing pure ARM assembly code.

Basic inline assembly syntax


__asm [volatile] (code);
code is the assembly instruction.
For ex.,

/* NOP example */
asm("mov r0,r0");
You can write more than one assembler instruction in a single inline asm statement.

asm(
"mov     r0, r0\n\t"
"mov     r0, r0\n\t"
"mov     r0, r0\n\t"
"mov     r0, r0"
);

Extended inline assembly syntax

However, registers and constants are specified in a different way, if they refer to C expressions.
  
__asm  [volatile] ( code_template 
					: output operand list 
                    : input operand list 
                    : clobber list);
code_template is a template for an assembly instruction.
The connection between assembly language and C operands is provided by an optional second and third part of the asm statement, the list of output and input operands.
  • Each operand consists of a symbolic name in square brackets
  • a constraint string
    • "=r" for the output operands
    • "r" for the output operands
  • a C expression in parentheses.
For ex.,
  
/* Rotating bits example */
asm("mov %[result], %[value], ror #1" :: [result] "=r" (y) : [value] "r" (x));

The following example sets the current program status register of the ARM CPU. It uses an input, but no output operand.

  
asm ("msr cpsr,%[ps]" 
     :: 
     :: [ps] "r" (status));

ARM Trusted Firmware Porting Guide

Introduction

Porting the ARM Trusted Firmware to a new platform involves making some mandatory and optional modifications for both the cold and warm boot paths.

Common Modifications

Common mandatory modifications

A platform port must enable the Memory Management Unit (MMU) with identity mapped page tables, and enable both the instruction and data caches for each BL stage.
In the ARM FVP port, each BL stage configures the MMU in its platform- specific architecture setup function, for example blX_plat_arch_setup().

2.2 Handling reset

BL1 by default implements the reset vector where execution starts from a cold or warm boot.
BL3-1 can be optionally set as a reset vector using the RESET_TO_BL31 make variable.

2.3 Common optional modifications

The following are helper functions implemented by the firmware that perform common platform-specific tasks.
  • int platform_get_core_pos(unsigned long)
  • A platform may need to convert the MPIDR of a CPU to an absolute number, which can be used as a CPU-specific linear index into blocks of memory.
    This routine contains a simple mechanism to perform this conversion, using the assumption that each cluster contains a maximum of 4 CPUs:
    
    linear index = cpu_id + (cluster_id * 4)
    
    cpu_id = 8-bit value in MPIDR at affinity level 0
    cluster_id = 8-bit value in MPIDR at affinity level 1    
        

3 Boot Loader stage specific modifications

3.1 Boot Loader stage 1 (BL1)

3.2 Boot Loader stage 2 (BL2)

3.3 Boot Loader stage 3-1 (BL3-1)

3.3.1 Power State Coordination Interface (in BL3-1)

The ARM Trusted Firmware's implementation of the PSCI API is based around the concept of an affinity instance.
Each affinity instance can be uniquely identified in a system by a CPU ID (the processor MPIDR is used in the PSCI interface) and an affinity level.

CPU affinity enables binding a process or multiple processes to a specific CPU core in a way that the process(es) will run from that specific core only.
When trying to perform performance testing on a host with many cores, it is wise to run multiple instances of a process, each one on different core.
This enables higher CPU utilization.

PSCI implementation (in BL3-1)

Interrupt Management framework (in BL3-1)

Crash Reporting mechanism (in BL3-1)

C Library

Storage abstraction layer


Fixed Virtual Platforms(FVP)

Fixed Virtual Platforms (FVPs) are complete simulations of an Arm system, including processor, memory and peripherals.
These are set out in a "programmer's view", which gives you a comprehensive model on which to build and test your software.


Learning operating system development using Linux kernel and Raspberry Pi

Introduction

Contribution guide

Prerequisites

Lesson 1: Kernel Initialization

1.1 Introducing RPi OS, or bare metal “Hello, world!” Linux 1.2 Project structure 1.3 Kernel build system 1.4 Startup sequence 1.5 Exercises

Lesson 2: Processor initialization

2.1 RPi OS

Exception levels

Each ARM processor that supports ARM.v8 architecture has 4 exception levels.
You can think about an exception level (or EL for short) as a processor execution mode in which only a subset of all operations and registers is available.
The least privileged exception level is level 0. When processor operates at this level, it mostly uses only general purpose registers (X0 - X30) and stack pointer register (SP). EL0 also allows using STR and LDR commands to load and store data to and from memory and a few other instructions commonly used by a user program.

An operating system should deal with exception levels because it needs to implement process isolation.
A user process should not be able to access other process’s data.
To achieve such behavior, an operating system always runs each user process at EL0.
Operating at this exception level a process can only use it’s own virtual memory and can’t access any instructions that change virtual memory settings.
So, to ensure process isolation, an OS need to prepare separate virtual memory mapping for each process and put the processor into EL0 before transferring execution to a user process.

An operating system itself usually works at EL1.
While running at this exception level processor gets access to the registers that allows configuring virtual memory settings as well as to some system registers. Raspberry Pi OS also will be using EL1.

EL2 is used in a scenario when we are using a hypervisor.
In this case host operating system runs at EL2 and guest operating systems can only use EL 1.
This allows host OS to isolate guest OSes in a similar way how OS isolates user processes.

EL3 is used for transitions from ARM “Secure World” to “Insecure world”.
This abstraction exist to provide full hardware isolation between the software running in two different “worlds”.
Application from an “normal world” has no way to access or modify information (both instruction and data) that belongs to “Secure world”, and this restriction is enforced at the hardware level.

Finding current Exception level

A small function can figure out at which exception level is:

.globl get_el
get_el:
    mrs x0, CurrentEL
    lsr x0, x0, #2
    ret
Here we use mrs instruction to read the value from CurrentEL system register into x0 register.
Then we shift this value 2 bits to the right (we need to do this because first 2 bits in the CurrentEL register are reserved and always have value 0).
And finally in the register x0 we have an integer number indicating current exception level.
To display this value,

    int el = get_el();
    printf("Exception level: %d \r\n", el);

Changing current exception level

In ARM architecture there is no way how a program can increase its own exception level without the participation of the software that already runs on a higher level.
Current EL can be changed only if an exception is generated. This can happen if:
  • a program executes some illegal instruction
  • for example, tries to access memory location at a nonexisting address, or tries to divide by 0
  • an application can run svc instruction to generate an exception on purpose
  • a hardware interrupt
Whenever an exception is generated the following sequence of steps takes place : (assuming that the exception is handled at EL n)
  • Address of the current instruction is saved in the ELR_ELn register. ( Exception link register )
  • Current processor state is stored in SPSR_ELn register (Saved Program Status Register)
  • An exception handler is executed and does whatever job it needs to do.
  • exception handler also needs to store the state of all general purpose registers and restore it back afterwards
  • Exception handler calls eret instruction.
  • This instruction restores processor state from SPSR_ELn and resumes execution starting from the address, stored in the ELR_ELn register.
An important thing to know is that :
  • exception handler is not obliged to return to the same location from which the exception originates.
  • Both ELR_ELn and SPSR_ELn are writable and exception handler can modify them if it wants to.
We are going to use this technique to our advantage when we try to switch from EL3 to EL1 in our code.

Switching to EL1

Strictly speaking, operating system is not obliged to switch to EL1, but EL1 is a natural choice because this level has just the right set of privileges to implement all common OS tasks.

#include "arm/sysregs.h"

#include "mm.h"

.section ".text.boot"

.globl _start
_start:
	mrs	x0, mpidr_el1		
	and	x0, x0,#0xFF		// Check processor id
	cbz	x0, master		// Hang for all non-primary CPU
	b	proc_hang

proc_hang: 
	b 	proc_hang

master:
	ldr	x0, =SCTLR_VALUE_MMU_DISABLED
	msr	sctlr_el1, x0		

	ldr	x0, =HCR_VALUE
	msr	hcr_el2, x0

	ldr	x0, =SCR_VALUE
	msr	scr_el3, x0

	ldr	x0, =SPSR_VALUE
	msr	spsr_el3, x0

	adr	x0, el1_entry		
	msr	elr_el3, x0

	eret				

el1_entry:
	adr	x0, bss_begin
	adr	x1, bss_end
	sub	x1, x1, x0
	bl 	memzero

	mov	sp, #LOW_MEMORY
	bl	kernel_main
	b 	proc_hang		// should never come here
Analysis:
  1. sctlr_el1
  2. sctlr_el1 is responsible for configuring different parameters of the processor, when it operates at EL1.
    For example, it controls whether the cache is enabled and, what is most important for us, whether the MMU (Memory Management Unit) is turned on.
    sctlr_el1 is accessible from all exception levels higher or equal than EL1 (you can infer this from _el1 postfix)
    
    // Some bits in the description of sctlr_el1 register are marked as RES1. 
    // Those bits are reserved for future usage and should be initialized with 1.
    #define SCTLR_RESERVED               (3 << 28) | (3 << 22) | (1 << 20) | (1 << 11)
    
    // This field controls endianess of explicit data access at EL1.
    // We are going to configure the processor to work only with little-endian format.
    #define SCTLR_EE_LITTLE_ENDIAN          (0 << 25)
    // this one controls endianess of explicit data access at EL0
    #define SCTLR_EOE_LITTLE_ENDIAN         (0 << 24)
    
    // Disable instruction cache.
    #define SCTLR_I_CACHE_DISABLED          (0 << 12)
    
    // Disable data cache.
    #define SCTLR_D_CACHE_DISABLED          (0 << 2)
    
    // Disable MMU.
    #define SCTLR_MMU_DISABLED            (0 << 0)
    #define SCTLR_MMU_ENABLED             (1 << 0)
    
    #define SCTLR_VALUE_MMU_DISABLED	(SCTLR_RESERVED | SCTLR_EE_LITTLE_ENDIAN | SCTLR_I_CACHE_DISABLED | SCTLR_D_CACHE_DISABLED | SCTLR_MMU_DISABLED)        
            
  3. hcr_el2
  4. Even We are not going to implement our own hypervisor. Stil we need to use this register because, among other settings, it controls the execution state at EL1.
    Execution state must be AArch64 and not AArch32.
    
    #define HCR_RW	    			(1 << 31)
    #define HCR_VALUE			HCR_RW    
        
  5. scr_el3
  6. This register is responsible for configuring security settings.
    For example, it controls whether all lower levels are executed in “secure” or “nonsecure” state.
    It also controls execution state at EL2.
    
    #define SCR_RESERVED	    		(3 << 4)
    #define SCR_RW				(1 << 10)
    #define SCR_NS				(1 << 0)
    #define SCR_VALUE	    	    	(SCR_RESERVED | SCR_RW | SCR_NS)   
        
  7. spsr_el3
  8. spsr_el3 contains processor state, that will be restored after we execute eret instruction.
    It is worth saying a few words explaining what processor state is.
    Processor state includes the following information:
    • Condition Flags
    • Those flags contains information about previously executed operation: whether the result was negative (N flag), zero (A flag), has unsigned overflow (C flag) or has signed overflow (V flag).
      Values of those flags can be used in conditional branch instructions.
      For example, b.eq instruction will jump to the provided label only if the result of the last comparison operation is equal to 0.
      The processor checks this by testing whether Z flag is set to 1.
    • Interrupt disable bits
    • Those bits allows to enable/disable different types of interrupts.
    • Some other information, required to fully restore the processor execution state after an exception is handled.
    Usually spsr_el3 is saved automatically when an exception is taken to EL3.
    However this register is writable, so we take advantage of this fact and manually prepare processor state.
    
    // After we change EL to EL1 all types of interrupts will be masked (or disabled, which is the same).        
    #define SPSR_MASK_ALL 			(7 << 6)
    // At EL1 we can either use our own dedicated stack pointer or use EL0 stack pointer.
    // EL1h mode means that we are using EL1 dedicated stack pointer.
    #define SPSR_EL1h			(5 << 0)
    #define SPSR_VALUE			(SPSR_MASK_ALL | SPSR_EL1h)        
            
  9. ELR_EL3

2.2 Linux

2.3 Exercises

Lesson 3: Interrupt handling

3.1 RPi OS Linux 3.2 Low level exception handling 3.3 Interrupt controllers 3.4 Timers 3.5 Exercises

Lesson 4: Process scheduler

4.1 RPi OS Linux 4.2 Scheduler basic structures 4.3 Forking a task 4.4 Scheduler 4.5 Exercises

Lesson 5: User processes and system calls

5.1 RPi OS 5.2 Linux 5.3 Exercises

Lesson 6: Virtual memory management

6.1 RPi OS 6.2 Linux (In progress) 6.3 Exercises

Lesson 7: Signals and interrupt waiting (To be done)

Lesson 8: File systems (To be done)

Lesson 9: Executable files (ELF) (To be done)

Lesson 10: Drivers (To be done)

Lesson 11: Networking (To be done)


嵌入式系統建構:開發運作於STM32的韌體程式


Programming with 64-Bit ARM Assembly Language

Single Board Computer Development for Raspberry Pi and Mobile Devices
Stephen Smith

Introduction

This book delves into how these are programmed at the bare metal level and provides insight into their architecture.
Knowing how the processor works will let you write more efficient C code.
Source Code Location: https://github.com/Apress/Programming-with-64-Bit-ARM--Assembly-Languag

CHAPTER 1 Getting Started

The idea was to use reduced instruction set computer (RISC) technology as opposed to complex instruction set computer (CISC) .
Writing in Assembly is harder, as you must solve problems with memory addressing and CPU registers that is all handled transparently by high- level languages.

Hardware

  • Broadcom BCM2711, 四核Cortex-A72 (ARM v8) 64位元 1.5GHz處理器
  • 4GB LPDDR4-3200 SDRAM
  • 2.4 GHz/5.0 GHz IEEE 802.11b/g/n/ac 無線網路, 藍牙 5.0 BLE
  • Gigabit Ethernet
  • 2個USB 3.0埠; 2個USB 2.0埠
  • Raspberry Pi標準40 pin GPIO排針擴充板插座
  • 2個micro-HDMI埠 (可達4K60幅顯示輸出)
  • 2-lane MIPI DSI顯示埠
  • 2-lane MIPI CSI相機埠
  • 4-pole 立體聲音和複合視訊埠
  • H264 (1080p60解碼, 1080p30編碼)
  • OpenGL ES 3.0 graphics
  • Micro-SD卡插槽
  • 5V DC 可經由USB-C插座輸入 (最小3A)
  • 5V DC 可經由GPIO插座輸入 (最小3A)
  • 5V DC 可經由PoE輸入 (需要另外安裝PoE擴充板)
  • 工作環境溫度: 0 - 50 度C

Software

Raspberry Pi OS with desktop

  • Downloading Installing the Operating System
    • Raspberry Pi OS (Legacy) with desktop
    • 
      https://downloads.raspberrypi.org/raspios_oldstable_armhf/images/raspios_oldstable_armhf-2022-04-07/2022-04-04-raspios-buster-armhf.img.xz      
            
    • Raspberry Pi OS Lite
    • 
      https://downloads.raspberrypi.org/raspios_lite_armhf/images/raspios_lite_armhf-2022-04-07/2022-04-04-raspios-bullseye-armhf-lite.img.xz        
              
    • Raspberry Pi OS with desktop
    • 
      https://downloads.raspberrypi.org/raspios_armhf/images/raspios_armhf-2022-04-07/2022-04-04-raspios-bullseye-armhf.img.xz     
              
    • Raspberry Pi OS with desktop and recommended software(32 bits)
    • 
      https://downloads.raspberrypi.org/raspios_full_armhf/images/raspios_full_armhf-2022-04-07/2022-04-04-raspios-bullseye-armhf-full.img.xz        
              
    • Raspberry Pi OS with desktop(64 bits)
    • 
      https://downloads.raspberrypi.org/raspios_arm64/images/raspios_arm64-2022-04-07/2022-04-04-raspios-bullseye-arm64.img.xz        
              
  • Installing the Operating System
  • Install Raspberry Pi OS using Raspberry Pi Imager:
    
    $ sudo apt install rpi-imager  
    	
    Open Raspberry Pi Imager and choose the required OS from the list presented.
    Or, on Linux, you can use the standard command line tools:
    
    $ sudo dd if=2021-10-30-raspios-bullseye-armhf.img of=/dev/sdX bs=4M conv=fsync
    	
  • Use a USB stick as the system partitions
  • 
    $ tree /dev/disk/
    ...
    └── by-uuid
        ├── 137e2641-afc7-4d05-bfbf-a40cad4f8261 -> ../../sda1  (swap, 8G)
        ├── ec464b47-461d-4b86-acc8-6ab342d6a8e3 -> ../../sda2  (/usr, 8G)
        ├── cff70456-f637-4eed-945b-3c95a8bc48db -> ../../sda3  (/opt, 1G)
        ├── 69b33879-49f9-4d4e-b787-07b0b60211ba -> ../../sda5  (/var, 2G)
        └── f0d406f3-3da0-4fa7-8aa7-9eaf2b74047e -> ../../sda6  (/home, 9.6G)
    
    $ sudo mkswap /dev/sda1
    $ sudo swapon -U 137e2641-afc7-4d05-bfbf-a40cad4f8261
    
    $ cat /etc/fstab
    proc            /proc           proc    defaults          0       0
    PARTUUID=003e8b7d-01  /boot           vfat    defaults,flush    0       2
    PARTUUID=003e8b7d-02  /               ext4    defaults,noatime  0       1
    # a swapfile is not a swap partition, no line here
    #   use  dphys-swapfile swap[on|off]  for that
    # /dev/sda1 for swap
    UUID=137e2641-afc7-4d05-bfbf-a40cad4f8261    none    swap    sw      0   0
    # /dev/sda2 for /usr 
    UUID=ec464b47-461d-4b86-acc8-6ab342d6a8e3   /usr   ext4   defaults   0       2
    # /dev/sda3 for /opt
    UUID=cff70456-f637-4eed-945b-3c95a8bc48db   /opt   ext4   defaults   0       2
    # /dev/sda5 for /var
    UUID=69b33879-49f9-4d4e-b787-07b0b60211ba   /var   ext4   defaults   0       2
    # /dev/sda6 for /home
    UUID=f0d406f3-3da0-4fa7-8aa7-9eaf2b74047e   /home  ext4   defaults   0       2
    
        
  • 安裝酷音輸入法,輸入指令
  • 
    $ sudo apt-get -y install scim-chewing
    	
    登出後再重新登入,即會生效. 輸入框按下「Ctrl」 + 「Space」後,螢幕右上角圖示變更,此時可輸入中文。
If you are installing Raspberry Pi OS Lite and intend to run it headless, you will still need to create a new user account. Since you will not be able to create the user account on first boot, you MUST configure the operating system using the Advanced Menu.

Ubuntu


$ xzcat /home/jerry/Downloads/ubuntu-22.04-preinstalled-desktop-arm64+raspi.img.xz | sudo dd of=/dev/sdd bs=32M; sync

Kali Linux

Kali Linux works very well and will be using it to test all the programs in this book.

Kali Linux contains several hundred tools targeted towards various information security tasks, such as Penetration Testing, Security Research, Computer Forensics and Reverse Engineering.

To install a pre-built image of the standard build of Kali Linux on your Raspberry Pi 4, follow these instructions:

  1. Get a fast microSD card with at least 16GB capacity.
  2. Class 10 cards are highly recommended.
  3. Download and validate our preferred Kali Raspberry Pi 4 image from the downloads area.
  4. The process for validating an image is described in more detail on Downloading Kali Linux.
  5. Use the dd utility to image this file to your microSD card (same process as making a Kali USB.
  6. Assume the storage device is located at /dev/sdd.
    
    $ xzcat kali-linux-2022.1-raspberry-pi-arm64.img.xz | sudo dd of=/dev/sdd bs=4M status=progress  
    	
Once the dd operation is complete, boot up the Raspberry Pi 4 with the microSD plugged in.
You should be able to log in to Kali.

    User: kali
    Password: kali
Enable ssh login:
  1. Install Kali Linux remote SSH-OpenSSH server
  2. 
    $ sudo apt-get install ssh  
    $ sudo service ssh start
    	
  3. Enable Kali Linux Remote SSH Service
  4. 
    $ sudo update-rc.d -f ssh remove
    $ sudo update-rc.d -f ssh defaults
    	
  5. check whether the service is running.
  6.   
    $ sudo apt-get install chkconfig
    $ sudo chkconfig -l ssh  
    	
Default Tool Credentials

ARM Assembly Instructions

The ARM is what is called a RISC computer, there are fewer instructions and each one is simple, so the processor can execute each instruction quickly.

CPU Registers

The registers are part of the CPU circuitry allowing instant access, whereas memory is a separate component and there is a transfer time for the CPU to access it.
In all computers, data is not operated in the computer’s memory; instead it’s loaded into a CPU register, then the data processing or arithmetic operation is performed in the registers.

If you want to add two numbers, you might do the following:

  1. Load one into one register and the other into another register.
  2. Perform the add operation putting the result into a third register.
  3. Copy the answer from the results register into memory.
A 64-bit program on an ARM processor in user mode can see:
  • X0–X30
  • These 31 registers are general purpose; you can use them for anything you like, though some have standard agreed-upon usage that we will cover later.
  • SP, XZR
  • The stack pointer or zero register depending on the context.
  • X30, LR
  • The link register.
    If you call a function, this register will be used to hold the return address.
    As this is a common operation, you should avoid using this register for other things.
  • PC
  • The program counter.
    The memory address of the currently executing instruction.
All the X registers can be operated on as 32-bit registers by referring to them as W0–W30 and WZR. When we do this, the instruction will use the lower 32 bits of the register and set the upper 32 bits to zero.

Using 32 bits saves memory.

ARM Instruction Format

Each ARM binary instruction is 32 bits long.
Every bit in the instructin is used to tell the processor what to do.
There are quite a few instruction formats, and it can be helpful to know how the bits for each instruction are packed into 32 bits.

Since there are 32 registers in used mode, it takes 5 bits to specify a register.

Having small fixed length instructions, it doesn’t need to start decoding an instruction to know how long it is and hence where the next instruction starts.
This is a key feature to allowing processing parallelism and efficiency.

Each instruction that takes registers can either use the 32-bit W version or the 64-bit Z version.
To specify which is the case, the high bit of each instruction specifies how we are viewing the registers.

Data processing are move, arithmetic, logical, comparison and multiply instructions.
The instruction encoding of the data processing instruction:
An instruction in isolation takes three clock cycles,
  1. one to load the instruction from memory
  2. one to decode the instruction, and then
  3. one to execute the instruction
The ARM is smart and works on three instructions at a time, each at a different step in the process, called the instruction pipeline.

Computer Memory

The 64-bit mode means:
  • Memory addresses are specified using 64 bits.
  • The CPU registers are each 64 bits wide and perform 64-bit integer arithmetic.
Instructions are 32 bits in size.

You can load from memory by using a register to specify the address to load.
This is called indirect memory access.

About the GCC Assembler

The general way you specify Assembly instructions is:

label:     opcode    operands
  • label:
  • optional and only required if you want the instruction to be the target of a branch instruction.
  • opcodes
  • each one is a short mnemonic such as
    • ADD for addition
    • LDR for load a register
    • B for branch
  • There are quite a few different formats for the operands
  • Install the GNU Compilers Collection (GCC)’s toolchain for the x86_64 platform
  • 
    $ sudo apt install -y build-essential
    $ sudo apt install -y crossbuild-essential-arm64
    $ sudo apt install -y crossbuild-essential-armhf
    
  • Native toolchains
  • 
    $ sudo apt update && sudo apt dist-upgrade
    $ sudo apt-get install build-essential gawk gcc g++ gfortran git texinfo bison libncurses-dev bc flex libssl-dev make
    

Hello World

HelloWorld.s:

.global _start // Provide program starting address
_start:
    mov     x0, #1      /* 1 = StdOut */
    ldr       x1, =helloworld /* string to print */
    mov     x2, #13     /* length of our string */
    mov     x8, #64     /* linux write() system call */
    svc      0           /* call Linux system call */

    // setup parameters to exit the program gracefully
    mov    x0, #0     // return code = 0
    mov    x8, #93    // service call 93
    svc     0           /* call Linux system call */

.data
helloworld:    .ascii  "Hello World!\n"
Build the execute:

$ as -o HelloWorld.o HelloWorld.s
$ ld -o HelloWorld HelloWorld.o
$ ./HelloWorld
Hello World!

About Comments

This is the same as comments in C/C++ code:
  • //double slashes
  • /∗ and ∗/

Where to Start

The Assembler marks the statement containing _start as the program entry point; then the linker can find it.
only one file can contain _start.

Assembly Instructions


svc 0 
command that executes software interrupt number 0.
This branches to the interrupt handler in the Linux kernel.

Data

A label “helloworld” followed by an .ascii directive which allocates one or more bytes of memory in the current section, and defines the initial contents of the memory from a string literal.

Calling Linux

This program makes two Linux system calls to do its work:
  • The first is the Linux write to file command (#64).
For any Linux system call,
  • Each system call number is specified by putting its function number in X8.
  • put the parameters in registers X0X7 depending on how many parameters are needed.
  • a return code is placed in X0 for checking the execution result
The software interrupt has another benefit of providing a standard mechanism to switch privilege levels.

Reverse Engineering Our Program


$ objdump -s -d HelloWorld.o

HelloWorld.o:     file format elf64-littleaarch64

Contents of section .text:
 0000 200080d2 e1000058 a20180d2 080880d2   ......X........
 0010 010000d4 000080d2 a80b80d2 010000d4  ................
 0020 00000000 00000000                    ........        
Contents of section .data:
 0000 48656c6c 6f20576f 726c6421 0a        Hello World!.   

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d2800020 	mov	x0, #0x1                   	// #1
   4:	580000e1 	ldr	x1, 20 <_start+0x20>
   8:	d28001a2 	mov	x2, #0xd                   	// #13
   c:	d2800808 	mov	x8, #0x40                  	// #64
  10:	d4000001 	svc	#0x0
  14:	d2800000 	mov	x0, #0x0                   	// #0
  18:	d2800ba8 	mov	x8, #0x5d                  	// #93
  1c:	d4000001 	svc	#0x0
	...
Let’s investigate the binary representation of the first MOV instruction which compiled to 0xd2800020:
  • The 1st bit is 1
  • It means to use the 64-bit version of the registers, in this case X0 rather than W0.
  • The 3rd bit is 0
  • It means that this instruction doesn’t set any flags that would affect conditional instructions.
  • The 2nd bit combined with the 4-th to 9-th bits make up the opcode for this MOV instruction.
  • This is move wide immediate, meaning it contains a 16-bit immediate value as the operand.
  • The 10-th and 11-th bits of 0 indicate there is no shift operation involved.
  • The 12-th to 27-th bits are the immediate value which is 1
  • The last 5 bits are the register to load.
  • These are 0 since we are loading register X0.

Chapter 2: Loading and Adding

To understand the ARM instruction set by going slowly through the MOV and ADD instructions.

Negative Numbers

The CPU must look at the sign bits, then decide whether to add or subtract and in which order.

About Two’s Complement

Two’s complement is to change all the 1s to 0s and all the 0s to 1s and then add 1.

-3 can be represented as

  
  ~ (0000 0011) +1 = 1111 1101 = 0xFD
For 1 byte calculation,
  
5 - 3 = 5 + (-3) = 5 + 0xFD = 0x102 = 2

About Gnome Programmer’s Calculator

The Gnome programmer’s calculator can calculate the two’s complement.

About One’s Complement

If we don’t add 1, and just change all the 1s to 0s and vice versa, then this is called one’s complement.

Big vs. Little Endian

Big endian is how we normally deal with numbers: the most significant byte or digits are placed leftmost in the structure (the big end, the low memory address). Known as the "network byte order," the TCP/IP Internet protocol also uses big endian regardless of the hardware at either end.

About Bi-endian

Pros of Little Endian

Even though Linux uses little endian, many protocols like TCP/IP used on the Internet use big endian and so require a transformation when moving data from the computer to the outside world.

Shifting and Rotating

0x30 = 3 * 16 = 3 * 2 4

About Carry Flag

When instructions execute, they can optionally set some flags that contain useful information on what happened. Then other instructions can test these flags and process accordingly.

About the Barrel Shifter

Basics of Shifting and Rotating

  • Logical shift left
  • The last bit shifted out ends up in the carry flag.
  • Logical shift right
  • the last bit shifted out ends up in the carry flag.
  • Arithmetic shift right
  • If we want to preserve the sign bit, use arithmetic shift right. Here a 1 comes in from the left, if the number is negative, and a 0 if it is positive.
  • Rotate right

Loading Registers

Instruction Aliases

MOV isn’t an ARM Assembly instruction; it’s an alias.
The Assembler finds a real ARM instruction to do the job.
For ex.,

ADD X0, XZR, X1
This instruction adds the contents of register X1 to the zero register and puts the result in X0.

If you use objdump, it might show the same alias you used, another alternate alias, or the real instruction. There is a “-M no-aliases” option for objdump where you can see the true underlying instruction.

MOV/MOVK/MOVN

There are several forms of the MOV instruction:
  • MOV(Register to Register)
  • For example:
    
    MOV X1, X2
    		
    This copies register X2 into register X1.
  • MOVK(move keep)
  • This loads the 16-bit immediate operand into one of four positions in the register without disturbing the other 48 bits.
    For ex., to load register X2 with the 64-bit hex value 0x1234FEDC4F5D6E3A
    
    MOV     X2, #0x6E3A
    MOVK   X2, #0x4F5D, LSL #16
    MOVK   X2, #0xFEDC, LSL #32
    MOVK   X2, #0x1234, LSL #48   
       
    The above example adding a shift operator to the second operand.

About Operand2

All the ARM’s data processing instructions have the option of taking a flexible Operand2 as one of their parameters.
There are three formats for Operand2:
  • A register and a shift
  • You can specify a register and a shift.
    For ex.,
    
    MOV   X1, X2, LSL #1    // Logical shift left
    MOV   X1, X2, LSR #1    // Logical shift right
    MOV   X1, X2, ASR #1    // Arithmetic shift right
    MOV   X1, X2, ROR #1   // Rotate right
        
    To make the code a little more readable, the Assembler provides mnemonics (aliases) for these to generate the same byte code,
    
    LSL   X1, X2, #1// Logical shift left
    LSR   X1, X2, #1// Logical shift right
    ASR   X1, X2, #1// Arithmetic shift right
    ROR   X1, X2, #1// Rotate right  
      
  • A register and an extension operation
  • The extension operations let us extract a byte, halfword, or word from the second register.
    • uxtb
    • Unsigned extend byte
    • uxth
    • Unsigned extend halfword
    • uxtw
    • Unsigned extend word
    • sxtb
    • Sign-extend byte
    • sxth
    • Sign-extend halfword
    • sxtw
    • Sign-extend word
    The extension operators aren’t available for the MOV instruction
  • A small number and a shift
  • The other form of operand2 consists of a small number and an optional shift amount.
    
      // Too big for #imm16
         MOV    X1, #0xAB000000
         
    will be translated by the Assembler to
    
    MOV   x1, #0xAB00, LSL #16
      

MOVN(Move Not)

It works just like MOV, except it reverses all the 1s and 0s as it loads the register.
It applies a logical NOT operation to each bit in the word you are loading into the register.
Its main usage:
  • To calculate the one’s complement
  • Multiply by -1.
  • The negative of a number is the two’s complement of the number, or the one’s complement plus one.

MOV Examples

The example to illustrate the MOV instructions.
This program doesn’t do anything besides move various numbers into registers.
movexamps.s,

// Examples of the MOV instruction.
//
.global _start  // Provide program starting address

// Load X2 with 0x1234FEDC4F5D6E3A first using MOV and MOVK
_start:
    mov x2, #0x6E3A
    MOVK X2, #0x4F5D, LSL #16
    MOVK X2, #0xFEDC, LSL #32
    MOVK X2, #0x1234, LSL #48
    // Just move W2 into W1
    MOV W1, W2
    // Now lets see all the shift versions of MOV
    MOV X1,X2,LSL #1  // Logical shift left
    MOV X1, X2, LSR #1 // Logical shift right
    MOV X1, X2, ASR #1 // Arithmetic shift right
    // Repeat the above shifts using mnemonics.
    LSL X1,X2,#1  // Logical shift left
    LSR X1,X2,#1  // Logical shift right
    ASR X1,X2,#1  //Arithmetic shift right
    ROR X1,X2,#1  // Rotate right

    // Example that works with 8 bit immediate and shift
    MOV X1, #0xAB000000  // Too big for #imm16
    // Example that can't be represented and results in an error
    // Uncomment the instruction if you want to see the error
    //   MOV   X1, #0xABCDEF11  // Too big for #imm16 and can't be represented.

    // Example of MOVN
    MOVN W1, #45

    // Example of a MOV that the Assembler will change to MOVN
    MOV W1, #0xFFFFFFFE  // (-2)

    // Setup the parameters to exit the program
    // and then call Linux to do it.
    MOV X0, #0  // Use 0 return code
    MOV X8, #93  // Serv command code 93 terms
    SVC 0  // Call linux to terminate    
    
We can see the true ARM 64-bit instructions that are produced by the Assembler by objdump:

$ objdump -s -d -M no-aliases movexamps.o

movexamps.o:     file format elf64-littleaarch64

Contents of section .text:
 0000 42c78dd2 a2eba9f2 82dbdff2 8246e2f2  B............F..
 0010 e103022a e10702aa e10742aa e10782aa  ...*......B.....
 0020 41f87fd3 41fc41d3 41fc4193 4104c293  A...A.A.A.A.A...
 0030 0160b5d2 a1058012 21008012 000080d2  .`......!.......
 0040 a80b80d2 010000d4                    ........        

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d28dc742 	movz	x2, #0x6e3a
   4:	f2a9eba2 	movk	x2, #0x4f5d, lsl #16
   8:	f2dfdb82 	movk	x2, #0xfedc, lsl #32
   c:	f2e24682 	movk	x2, #0x1234, lsl #48
  10:	2a0203e1 	orr	w1, wzr, w2
  14:	aa0207e1 	orr	x1, xzr, x2, lsl #1
  18:	aa4207e1 	orr	x1, xzr, x2, lsr #1
  1c:	aa8207e1 	orr	x1, xzr, x2, asr #1
  20:	d37ff841 	ubfm	x1, x2, #63, #62
  24:	d341fc41 	ubfm	x1, x2, #1, #63
  28:	9341fc41 	sbfm	x1, x2, #1, #63
  2c:	93c20441 	extr	x1, x2, x2, #1
  30:	d2b56001 	movz	x1, #0xab00, lsl #16
  34:	128005a1 	movn	w1, #0x2d
  38:	12800021 	movn	w1, #0x1
  3c:	d2800000 	movz	x0, #0x0
  40:	d2800ba8 	movz	x8, #0x5d
  44:	d4000001 	svc	#0x0
    
We can see the shift instructions were converted into UBFM, SBFM, and EXTR instructions.

ADD/ADC

These instructions all add their second and third parameters and put the result in their first parameter register destination (Rd):

ADD{S} Xd, Xs, Operand2
ADC{S} Xd, Xs, Operand2
The registers Rd and source register (Rs) can be the same.
Examples,

// the immediate value can be 12-bits, so 0-4095
// X2 = X1 + 4000
   ADD   X2, X1, #4000
// the shift on an immediate can be 0 or 12
// X2 = X1 + 0x20000
   ADD   X2, X1, #0x20, LSL 12
// simple addition of two registers
// X2 = X1 + X0
   ADD   X2, X1, X0
// addition of a register with a shifted register
// X2 = X1 + (X0 * 4)
   ADD   X2, X1, X0, LSL 2
// With register extension options
// X2 = X1 + signed extended byte(X0)
   ADD   X2, X1, X0, SXTB
// X2 = X1 + zero extended hal  
To print out a number, we must first convert the number to an ASCII string.
There is a trick, we can get one number from our program via the program’s return code.

/* This is a comment */
.global _start /* 'main' is our entry point and must be global */

_start:          /* This is main */
    mov w0, #2 /* Put a 2 inside the register w0 */
    // Setup the parameters to exit the program and then call Linux to do it.
    // W0 is the return code
    MOV X8, #93  // Service command code 93
    SVC 0  // Call linux to terminate
To see the return code after execution:

$ echo $?
2

Add with Carry

We can combine multiple ADD instructions to add arbitrarily large integers. The key to this is the carry flag.
When an addition overflows, it sets the carry flag.
The ARM processor adds 64 bits at a time, so we only need the carry flag if we are dealing with numbers larger than what will fit into 64 bits.
If we want an instruction to alter them, then we place an “S” on the end of the opcode, and the Assembler will set the carry flag( bit 29 ) when it builds binary version of the instruction.

This example will add two 128-bit integers,

  • registers X2 and X3 for the first 12b-bit number
  • registers X4 and X5 for the first 12b-bit number
  • X0 and X1 for the result.

ADDS  X1, X3, X5  // Lower order 64-bits
ADC   X0, X2, X4  // Higher order 64-bits
  • ADDS adds the lower order 64 bits and sets the carry flag
  • ADDC adds the higher-order words, plus the carry flag

SUB/SBC


SUB{S} Xd, Xs, Operand2
SBC{S} Xd, Xs, Operand2
The carry flag is used to indicate when a borrow is necessary.
SUBS will clear the carry flag if the result is negative and set it if positive; SBC then subtracts one if the carry flag is clear.

Chapter 3: Tooling Up

GNU Make

Rebuilding a File

A Rule for Building .s Files


%.o : %.s
    as $< -o $@
HelloWorld: HelloWorld.o
     ld -o HelloWorld HelloWorld.o
  • %.s
  • is like a wildcard meaning any .s file.
  • $<
  • is a symbol for the source file.
  • $@
  • is a symbol for the output file.

Defining Variables


TARGET = HelloWorld
OBJS = $(TARGET).o

GDB


sudo apt-get install gdb

Preparing to Debug

To add debug information to our program, we must Assemble it with the -g flag.
Use a Makefile variable to control the debug flag,

ifdef DEBUG
DEBUGFLGS = -g
else
DEBUGFLGS =
endi

Beginning GDB

Commands:
  • gdb executable
  • run
  • runs to completion
  • list
  • lists ten lines.
  • disassemble _start
  • shows the actual code produced by the Assembler with no comments.
  • b _start
  • To set a breakpoint. We can specify a line number, or a symbol for our breakpoint
  • s
  • step through the program
  • i r
  • see the values of the registers
  • c
  • continue to the next breakpoint
  • i b
  • see infomation of all breakpoints
  • delete 1
  • delete a breakpoint with the delete command, specifying the breakpoint number to delete.
  • x /Nfu addr
  • display content of memory in different formats.
    • N
    • the number of units to be displayed
    • f
    • the display format, commonly used:
      • t
      • binary
      • x
      • hexadecimal
      • d
      • decimal
      • i
      • instruction
      • s
      • string
    • u
    • unit size.
      • b
      • bytes
      • h
      • halfwords (16 bits)
      • w
      • words (32 bits)
      • g
      • giant words (64 bits)
  • q
  • quit gdb

(gdb) x /4ubft _start
0x400078 <_start>:    01000010   11000111
10001101   11010010
(gdb) x /4ubfi _start
   0x400078 <_start>:   mov    x2, #0x6e3a          // #28218
=> 0x40007c <_start+4>: movk    x2, #0x4f5d, lsl #16
   0x400080 <_start+8>: movk    x2, #0xfedc, lsl #32
   0x400084 <_start+12>: movk    x2, #0x1234, lsl #48
(gdb) x /4ubfx _start
0x400078 <_start>:      0x42    0xc7    0x8d    0xd2
(gdb) x /4ubfd _start
0x400078 <_start>:      66      -57     -115    -46

Cross-Compiling

Get all the necessary GNU and Linux tools to compile for ARM,

sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu
These tools will be installed under /usr/aarch64-linux-gnu/ so that it will not be used in Intel-based host machine by default path.
To use the cross-platform tools, add this path in our makefile:

TOOLPATH = /usr/aarch64-linux-gnu/bin
HelloWorld: HelloWorld.o
     $(TOOLPATH)/ld -o HelloWorld HelloWorld.o
HelloWorld.o: HelloWorld.s
     $(TOOLPATH)/as -o HelloWorld.o HelloWorld.s
It can be faster to do your builds on a more powerful laptop or desktop than on the target.
The workflow is to build the program on a full development (native) system and then transfer the program to the target processor using a USB cable, serial cable, or via Ethernet.

Emulation

There are quite a few different emulators available with Ubuntu Linux running on an Intel CPU.
To play around with Arm assembly without an Arm board, the QEMU user mode emulation is more than sufficient.
  • Executing ARM64 binaries (C to Binary)
  • Setting up a full-system QEMU emulation on your x86_64 Linux host system.
    
    $ sudo apt install qemu-user qemu-user-static gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu binutils-aarch64-linux-gnu-dbg build-essential
    		
    Create a file containing a simple C program for testing,
    
    #include <stdio.h>
    
    int main(void) { 
        return printf("Hello, I'm executing ARM64 instructions!\n"); 
    }
    		
    To compile the code as a static executable,
    
    $ aarch64-linux-gnu-gcc -static -o hello64 hello.c
    $ file hello64
    hello64: ELF 64-bit LSB executable, ARM aarch64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=f6e13f22124754ff411cd4c40011b3da72388684, for GNU/Linux 3.7.0, not stripped
    		
    Thanks to qemu-user-static, statically linked aarch64 binary can be run on our x86_64 host directly,
    
    $ ./hello64
    Hello, I'm executing ARM64 instructions!
    		
    To execute a dynamically linked Arm executable on our x86_64 host, the package that makes this possible is qemu-user.
    To compile the code as a dynamicly linked executable, compile the C code without the -static flag.
    
    $ aarch64-linux-gnu-gcc -o hello64dyn hello.c
    $ file ./hello64dyn
    ./hello64dyn: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, BuildID[sha1]=8d5a19d29c460ef70c98912db056e5e1ca9e9607, for GNU/Linux 3.7.0, not stripped
    		
    Then, we need to use qemu-aarch64 and supply the aarch64 libraries via the -L flag.
    
    $ qemu-aarch64 -L /usr/aarch64-linux-gnu ./hello64dyn
    Hello, I'm executing ARM64 instructions!        
            
  • Executing ARM32 binaries (C to Binary)
  • The same procedure applies to Arm 32-bit binaries, but we need to install different toolchains for Arm32 (in addition to the previously installed qemu-user packages).
    
    sudo apt install gcc-arm-linux-gnueabihf binutils-arm-linux-gnueabihf binutils-arm-linux-gnueabihf-dbg
        

Android NDK

Apple XCode

Source Control and Build Servers

Git

Jenkins

Chapter 4: Controlling Program Flow

Unconditional Branch

An unconditional branch to a labe:

B label
The label is interpreted as an offset from the current PC register and has 26 bits in the instruction.
This allows a jump of up to 128 megabytes in either direction.
An endless loop:


_start:   MOV X1, #1
           B _start

About Condition Flags

The condition flags are
  • Negative
  • N is 1 if the signed value is negative and cleared if the result is positive or 0.
  • Zero
  • Z Is set if the result is 0; this usually denotes an equal result from a comparison.
    If the result is nonzero, this flag is cleared.
  • Carry
  • For addition type operations, this flag is set if the result produces an overflow.
    For subtraction type operation, this flag is set if the result does not require a borrow.
    Also, it’s used in shifting to hold the last bit that is shifted out.
  • OVerflow
  • For addition and subtraction, this flag is set if a signed overflow occurred.
    Overflow occurs if the result is greater than or equal to 231, or less than -231.
These flags are stored in the NZCV system register.
These flags are only set if you append an “S” to the end of the instruction’s opcode, otherwise the flags will remain unmodified.

Branch on Condition

To only branch if a certain condition flags are set or clear.

B.{condition} label
where {condition} is taken from the following:
For ex.,

B.EQ _start
will branch to _start if the Z flag is set.

About the CMP Instruction


CMP Xn, Operand2
This instruction compares the contents of register Xn with Operand2.
This instruction is equivalent to

SUBS XZR, Xn, Operand2
The status flag will be updated accordingly. For example, to do a branch only if register W4 is 45,

B.EQ _start

Loops

Loops can be constructed with branch and comparison instructions.

FOR Loops


FOR I = 1 to 10
     ... some statements...
The above can be implemented:

      MOV W2, #1     // W2 holds I
loop: // body of the loop goes here.
      // Most of the logic is at the end
      ADD W2, W2, #1 // I = I + 1
      CMP W2, #10
      B.LE loop      // IF I <= 10 goto loop

While Loop


// WHILE X < 5
//      ... other statements ....
// END WHILE
// W4 is X and has been initialized
loop: CMP  W4, #5
      B.GE loopdone
      // ... other statements in the loop body ...
      B    loop
loopdone: // program continues

If/Then/Else

For ex,

IF W5 < 10 THEN
     .... if statements ...
ELSE
     ... else statements ...
END IF
Implement:

     CMP W5, #10
     B.GE elseclause
     ... if statements ...
     B endif
elseclause:
     ... else statements ...
endif:     // continue on after the /then/else ...          

Logical Operators

The ARM’s logical operators manipulate the bits in the registers.

AND{S}  Xd, Xs, Operand2
EOR{S}  Xd, Xs, Operand2
ORR{S}  Xd, Xs, Operand2
BIC{S}   Xd, Xs, Operand2

AND

AND performs a bitwise logical and operation between each bit in Xs and Operand2, putting the result in Xd.
For ex., if we only want the high-order byte of a register

     AND   W6, W6, #0xFF000000
     // shift the byte down to the
     // low order position.
     LSR   W6, W6, #24

EOR

EOR performs a bitwise exclusive or operation between each bit in Xs and Operand2, putting the result in Xd.

ORR

ORR performs a bitwise logical or operation between each bit in Xs and Operand2, putting the result in Xd.
For ex., set the low-order byte of X6 to all 1 bits (0xFF) while leaving the seven other bytes unaffected.

ORR   X6, X6, #0xFF

BIC

BIC (bit clear) performs Xs AND NOT Operand2.
The reason this is called bit clear is that
  • if the bit in Operand2 is 1, then the resulting bit will be 0.
  • For ex., This clears the low-order byte of X6, while leaving the other seven bytes unaffected
    
    BIC   X6, X6, #0xFF    
        
  • if the bit in Operand2 is 0, then the corresponding bit in Xs will be put in the result Xd.

Design Patterns

If you adopt a few standard design patterns for how to perform loops and other programming constructs, it will make reading your programs much easier.

Converting Integers to ASCII

Pseudo-code to print a register:

outstr = memory where we want the string + 9
// (string is form 0x123456789ABCDEF0 and we want
// the last character)
FOR W5 = 16 TO 1 STEP -1
      digit = X4 AND 0xf
      IF digit < 10 THEN
           asciichar = digit + '0'
      ELSE
           asciichar = digit + 'A' - 10
      END IF
      *outstr = asciichar
      outstr = outstr - 1
NEXT W5
printdword.s:

//
// Assembler program to print a register in hex
// to stdout.
//
// X0-X2 - parameters to linux function services
// X1 - is also address of byte we are writing
// X4 - register to print
// W5 - loop index
// W6 - current character
// X8 - linux function number
//
.global _start      // Provide program starting address
_start: MOV      X4, #0x6E3A
     MOVK X4, #0x4F5D, LSL #16
    MOVK X4, #0xFEDC, LSL #32
    MOVK X4, #0x1234, LSL #48
    
    LDR X1, =hexstr // start of string
    ADD X1, X1, #17 // start at least sig digit
   // The loop is FOR W5 = 16 TO 1 STEP -1
     MOV    W5, #16     // 16 digits to print
loop:AND    W6, W4, #0xf // mask of least sig digit
     // If W6 >= 10 then goto letter
      CMP  W6, #10        // is 0-9 or A-F
      B.GE letter
    // Else its a number so convert to an ASCII digit
     ADD   W6, W6, #'0'
     B     cont  // goto to end if
letter: // handle the digits A to F
     ADD   W6, W6, #('A'-10)
cont:// end if
     STRB  W6, [X1] // store ascii digit
     SUB   X1, X1, #1 // decrement address for next digit
     LSR   X4, X4, #4 // shift off the digit
     // next W5
     SUBS   W5, W5, #1    // step W5 by -1
     B.NE   loop          // another for loop if not done
    // Setup the parameters to print our hex number
    // and then call Linux to do it.
     mov     X0, #1       // 1 = StdOut
     ldr     X1, =hexstr  // string to print
     mov     X2, #19  // length of our string
     mov     X8, #64  // linux write system call
     svc     0     // Call linux to output the string
     // Setup the parameters to exit the program
     // and then call Linux to do it.
     mov     X0, #0  // Use 0 return code
     mov     X8, #93  // Service code 93 terminates
     svc     0           // Call linux to terminate
.data
hexstr: .ascii  "0x123456789ABCDEFG\n"
compile and execute the program,

$ as  printdword.s -o printdword.o
$ ld -o printdword printdword.o
$ ./printdword
0x1234FEDC4F5D6E3A

Using Expressions in Immediate Constants


ADD   W6, W6, #('A'-10)

Storing a Register to Memory


STRB W6, [X1]
The store byte (STRB) instruction saves the low-order byte of the first register into the memory location contained in X1.
The syntax [X1] is to make clear that we are using memory indirection, and not just putting the byte into register X1.

Why Not Print in Decimal

Performance of Branch Instructions

If you put a lot of branches in your code, you suffer a performance penalty.

More Comparison Instructions

Summary

Chapter 5: Thanks for the Memories

  • how to define data in memory
  • how to load memory into registers for processing
  • how to write the results back to memory
Memory addresses are 64 bits while instructions are 32 bit.

Defining Memory Contents

The GNU Assembler contains several directives to help you define memory in a .data section of your program.
Some sample memory directives:

label:
       .byte 74, 0112, 0b00101010, 0x4A, 0X4a, 'J', 'H' + 2
       .word 0x1234ABCD, -1434
       .quad 0x123456789ABCDEF0
       .ascii      "Hello World\n"
The .byte statement defines 1 or more bytes of memory.
The list of memory definition Assembler directives,

Aligning Data

These data directives put the data in memory contiguously byte by byte.
We can instruct the Assembler to align the next piece of data with an .align directive.
For ex.,

.data
     .byte    0x3F
     .align   4
     .word   0x12345678
The first is only 1 byte, the next word of data will not be aligned.
We can add the “.align 4” directive to make it word aligned.
This will result in three wasted bytes.
ARM Assembly instructions must be word aligned.
Usually the Assembler will give you an error when alignment is required, and throwing in an “.align 4” directive is a quick fix.

Loading a Register with an Address

PC Relative Addressing

Addresses can be represented as a register-relative or PC-relative expression.
  • A register-relative expression evaluates to a named register combined with a numeric expression.
  • A PC-relative expression is written in source code as the PC or a label combined with a numeric expression.
  • For PC relative addressing, it really becomes addressing relative to the current instruction.
    It can be expressed in the form:
    
    [PC, #number]
      
    The assembler calculates the required offset from the label and the address of the current instruction.
    It is recommended to write PC-relative expressions using labels rather than PC because the value of PC depends on the instruction set.
    
            LDR     r4,=data+4*n    ; n is an assembly-time variable
            ; code
            MOV     pc,lr
    data    DCD     value_0
            ; n-1 DCD directives
            DCD     value_n         ; data+4*n points here
            ; more DCD directives  
            
    A simpler ex.,
    
    LDR   X1, =helloworld        
            
    to load the address of our helloworld string into X1.
    The Assembler knows the value of the program counter at this point, so it can provide an offset to the correct memory address.

Loading Data from Memory

The simple form of LDR to load data given an address is

LDR{type}   Xt, [Xa]
where type is one of the types:
  • B
  • Unsigned byte
  • SB
  • signed byte
  • H
  • Unsigned halfword (16 bits)
  • SH
  • signed halfword (16 bits)
  • SW
  • signed word
the typical usage to load an address into a register and then use that address to load the data we want,

// load the address of mynumber into X1
      LDR   X1, =mynumber
// load the word stored at mynumber into X2
      LDR   X2,[X1]
      
.data
mynumber:   .QUAD 0x123456789ABCDEF0
it load 0x123456789ABCDEF0 into X2.

Note the square bracket syntax represents indirect memory access.
This means load the data stored at the address pointed to by X1, not move the contents of X1 into X2.

Indexing Through Memory

The ARM instruction set gives us support for the array indexing operation.
Suppose we have an array of 10 words (4 bytes each) defined:

arr1:   .FILL   10, 4, 0

      LDR    X1, =arr1                   ; load the array’s address
      // Load the first element
      LDR    W2, [X1]
      // Load element 3
      // The elements count from 0, so 2 is
      // the third one. Each word is 4 bytes,
      // so we need to multiply by 4
      LDR    W2, [X1, #(2 * 4)]
Using a register as an offset

// The 3rd element is still number 2
      MOV   X3, #(2 * 4)
// Add the offset in X3 to X1 to get our element.
      LDR   W2, [X1, X3]
If X1 points to the end of the array, we can do indexing shifts in reverse

LDR   W2, [X1, #-(2 * 4)]
MOV   X3, #(-2 * 4)
LDR   W2, [X1, X3]
Post-Indexed Addressing:

// Load X1 with the memory pointed to by X2
// Then do X2 = X2 + 2
   LDR   X1, [X2], #2

An Example Converting to Upper-Case

Pseudo-code:

i= 0
DO
    char = inStr[i]
    IF char >= 'a' AND char <= 'z' THEN
          char = char - ('a' - 'A')
    END IF
    outStr[i] = char
    i=i+ 1
UNTIL char == 0
PRINT outStr
in this ex., NULL-terminated strings is used, the input string is not changed, a new output string with the upper-case version of the input string is generated.
upper.s:

//
// X0-X2 - parameters to Linux function services
// X3 - address of output string
// X4 - address of input string
// W5 - current character being processed
// X8 - linux function number
//
.global _start // Provide program starting address to linker
_start: LDR   X4, =instr      // start of input string
          LDR   X3, =outstr     // address of output string
// The loop is until byte pointed to by X1 is non-zero
loop: LDRB W5, [X4], #1 // load character and incr pointer
// If W5 > 'z' then goto cont
       CMP   W5, #'z'         // is letter > 'z'?
       B.GT  cont
// Else if W5 < 'a' then goto end if
       CMP   W5, #'a'
       B.LT  cont            // goto to end if
// if we got here then the letter is lower case, so convert it.
       SUB   W5, W5, #('a'-'A')
cont:  // end if
STRB W5, [X3], #1 // store character to output str CMP W5, #0 // stop on hitting a null character B.NE loop // loop if character isn't null
// Setup the parameters to print our hex number
// and then call Linux to do it.
MOV    X0, #1
LDR    X1, =outstr
SUB    X2, X3, X1
MOV    X8, #64
SVC    0
// 1 = StdOut
// string to print
// get the len by sub'ing the
   pointers
// Linux write system call
// Call Linux to output the string
// Setup the parameters to exit the program
// and then call Linux to do it.
MOV    X0, #0
MOV    X8, #93
SVC    0
// Use 0 return code
// Service code 93 terminates
// Call Linux to terminate the
program
.data
instr: .asciz "This is our Test String that we will convert.\n" outstr: .fill 255, 1, 0
compile and run the program,

$ as   upper.s -o upper.o
$ ld -o upper upper.o
$ ./upper
THIS IS OUR TEST STRING THAT WE WILL CONVERT.
LDR and STR just load and save; they don’t have functionality to examine what they are loading or saving, so they can’t set the condition flags, hence the need for the CMP instruction in the UNTIL part of the loop to test for NULL.

Storing a Register

The STR instruction is a mirror of the LDR instruction.

Double Registers

There are doubleword versions of all the LDR and STR instructions: LDP and STP.
For example, to load the address of a 128-bit quantity (the address is still 64 bits) and then loads the 128 bits into X2 and X3. Then we store X2 and X3 back into the myoctaword:

      LDR   X1, =myoctaword
      LDP   X2, X3, [X1]
      STP   X2, X3, [X1]
.data
myoctaword: .OCTA 0x12345678876543211234567887654321
these instructions are extensively used when we need to save registers to the stack and later restore them.

Summary

Chapter 6: Functions and the Stack

Stacks on Linux

Branch with Link

Nesting Function Calls

Function Parameters and Return Values

Managing the Registers

Summary of the Function Call Algorithm

Upper-Case Revisited

Stack Frames

Stack Frame Example

Macros

Include Directive

Macro Definition

Labels

Why Macros

Macros to Improve Code

Summary

Chapter 7: Linux Operating System Service

So Many Services

Calling Convention

Linux System Call Numbers

Return Codes

Structures

Wrappers

Converting a File to Upper-Case

Building .S Files

Opening a File

Error Checking

Looping

Summary

Chapter 8: Programming GPIO Pins

We can program the GPIO pins in two ways:
  • by using the Linux device driver
  • by accessing the GPIO controller’s registers directly

GPIO Overview

On the raspberry Pi, pins 3, 5, 7–8, 10–13, 15, 16, 18, 19, 21–24, and 26: Are programmable general purpose.

In Linux, Everything Is a File

Flashing LEDs

Moving Closer to the Metal

Virtual Memory

In Devices, Everything Is Memory

Registers in Bits

GPIO Function Select Registers

GPIO Output Set and Clear Registers

More Flashing LEDs

Root Access

Table Driven

Setting Pin Direction

Setting and Clearing Pins

Summary

Chapter 9: Interacting with C and Pythons

Calling C Routines

Printing Debug Information

Adding with Carry Revisited

Calling Assembly Routines from C

Packaging Our Code

Static Library

Shared Library

Embedding Assembly Code Inside C Code

Calling Assembly from Python

Summary

Chapter 10: Interfacing with Kotlin and Swift

Chapter 11: Multiply, Divide, and Accumulate

Chapter 12: Floating-Point Operations

Chapter 13: Neon Coprocessor

Chapter 14: Optimizing Code

Chapter 15: Reading and Understanding Code

Chapter 16: Hacking Code

Appendix A: The ARM Instruction Set

Appendix B: Binary Formats

Appendix C: Assembler Directive

Appendix D: ASCII Character Set


ARM (32-bits) assembler in Raspberry Pi

1 Introduction

2 Registers and basic arithmetic

3 Memory, addresses. Load and store.

4 GDB

5 Branches

6 Control structures

7 Indexing modes

8 Arrays and structures and more indexing modes.

9 Functions (I)

10 Functions (II). The stack

11 Predication

12 Loops and the status register

13 Floating point numbers

14 Matrix multiply

15 Integer division

16 Switch control structure

17 Passing data to functions

18 Local data and the frame pointer

19 The operating system

20 Indirect calls

21 Subword data

22 The Thumb instruction set

23 Nested functions

24 Trampolines

25 Integer SIMD

26 A primer about linking

27 Dynamic linking


Introduction to Computer Organization: ARM Assembly Language Using the Raspberry Pi

Robert G. Plantz

Chapter 1 Introduction

This book begins with the fundamental high-level language concepts and “looks under the hood” to see how they are implemented at the assembly language level.

There are many challenging opportunities in programming embedded systems, and much of the work in this area demands at least an understanding of the ISA(instruction set architecture).

1.1 Efficient Use of This Book

1.2 Computer Subsystems

The von Neumann architecture: both the program instructions and data are stored in a memory unit that is separate from the processing unit.
We will focus on how the program and data are stored in memory and how the CPU executes instructions.

1.3 How the Subsystems Interact

The buses shown here are logical groupings of the signals that must pass between the three subsystems.
For example, the PCI bus standard uses the same physical pathway for the address and the data, but at different times.
Control signals indicate whether there is an address or data on the lines at any given time.

If the CPU is instructed to store data in memory, it places the data on the data bus, places the location in memory where the data is to be stored on the address bus, and places a “write” signal on the control bus. The memory subsystem responds by copying the data on the data bus into the specified memory location.

1.4 Setting Up Your Raspberry Pi

Installing the binutils-doc package to get full documentation for the GNU assembler, as.

Chapter 2 Data Storage Formats

2.1 Bits and Groups of Bits

2.2 Exercises

2.3 Mathematical Equivalence of Binary and Decimal

2.4 Exercises

2.5 Unsigned Decimal to Binary Conversion

2.6 Exercises

2.7 Memory

2.8 Exercises

2.9 Using C Programs to Explore Data Formats

2.10 Programming Exercises

2.11 Examining Memory With a Debugger


/* intAndFloat.c
 * Using printf to display an integer and a float.
 * 2017-09-29: Bob Plantz
 */
#include <stdio.h>

int main(void)
{
  int anInt = 19088743;
  float aFloat = 19088.743;

  printf("The integer is %d and the float is %f\n", anInt, aFloat);

  return 0;
}
Build the example the run the gdb:

$ gcc -g -Wall -o intAndFloat intAndFloat.c

$ gdb ./intAndFloat
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
...

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./intAndFloat...
(gdb) 

gdb has a large number of commands.
The few here will be sufficient to get you started:
  • li LineNumber
  • List ten lines of the source code, centered at the line number specified by LineNumber.
    
    (gdb) li
    1	/* intAndFloat.c
    2	 * Using printf to display an integer and a float.
    3	 * 2017-09-29: Bob Plantz
    4	 */
    5	#include <stdio.h>
    6	
    7	int main(void)
    8	{
    9	  int anInt = 19088743;
    10	  float aFloat = 19088.743;     
    (gdb) 
    11	
    12	  printf("The integer is %d and the float is %f\n", anInt, aFloat);
    13	
    14	  return 0;
    15	}
    16
            
    Simply pushing the return key will repeat the previous command, and li is smart enough to display the next (up to) ten lines.
  • br source-filename:line-number
  • Set a breakpoint at the specified line-number in the source file, source-filename.
    Control will return to gdb when the line number is encountered.
    
    (gdb) br 12
    Breakpoint 1 at 0x798: file intAndFloat.c, line 12.
    	
    I set a breakpoint at line 12.
    Execution will pause before the statement is executed
  • r
  • Begin execution of a program that has been loaded under control of gdb.
    
    (gdb) r
    Starting program: /home/pi/intAndFloat 
    
    Breakpoint 1, main () at intAndFloat.c:12
    12	  printf("The integer is %d and the float is %f\n", anInt, aFloat);    
        
    The run command causes the program to start execution from the beginning.
  • print Expression
  • Evaluate Expression and print the value.
    
    (gdb) print anInt
    $1 = 19088743
    (gdb) print aFloat
    $2 = 19088.7422
    (gdb) printf "anInt = %i and aFloat = %f\n", anInt, aFloat
    anInt = 19088743 and aFloat = 19088.742188
      	
  • help command
  • Help on how to use command.
    
    (gdb) help x
    Examine memory: x/FMT ADDRESS.
    ADDRESS is an expression for the memory address to examine.
    FMT is a repeat count followed by a format letter and a size letter.
    Format letters are o(octal), x(hex), d(decimal), u(unsigned decimal),
      t(binary), f(float), a(address), i(instruction), c(char), s(string)
      and z(hex, zero padded on the left).
    Size letters are b(byte), h(halfword), w(word), g(giant, 8 bytes).
    The specified number of objects of the specified size are printed
    according to the format.  If a negative number is specified, memory is
    examined backward from the address.
    
    Defaults for format and size letters are those previously used.
    Default count is 1.  Default address is following last thing printed
    with this command or "print".
    
  • x/FMT MemoryAddress
  • Display (examine) n values in memory in format FMT of size s starting at MemoryAddress. To get the actual memory addresses of variables:
    
    (gdb) print &anInt
    $3 = (int *) 0x7ffffff3dc
    (gdb) print &aFloat
    $4 = (float *) 0x7ffffff3d8
    
    (gdb) x/1dw 0x7ffffff3dc
    0x7ffffff3dc:	19088743
    (gdb) x/1fw 0x7ffffff3d8
    0x7ffffff3d8:	19088.7422
    (gdb) x/1xw 0x7ffffff3dc
    0x7ffffff3dc:	0x01234567
    (gdb) x/4xb 0x7ffffff3dc
    0x7ffffff3dc:	0x67	0x45	0x23	0x01
      
  • cont
  • Continue program execution from the current location.
  • i r
  • Show the contents of the registers (“info registers”).
  • printf "format", var1, var2,…
  • Display the values of var1, var2,….
    The "format" string follows the same rules as the printf in the C Standard Library.

2.12 Programming Exercise

2.13 Storing Characters

2.14 Programming Exercise

2.15 Low-level Character Handling

2.16 Programming Exercises

2.17 Accessing the GPIO in C

Chapter 3 Computer Arithmetic

3.1 Addition and Subtraction

3.2 Exercises

3.3 Arithmetic Errors—Unsigned Integers

Use four-bit values to simplify the discussion.
Consider addition of the two unsigned integers, 2 and 4:

   0010      0100      0100
+ 0100    + 1110    - 1110
------  ------  ------
   0110       0010      0110
   
Carry =0 Carry=1  Carry=1
These four-bit arithmetic examples generalize to any size arithmetic performed by the computer.
When adding or subtracting two unsigned integers, the result is arithmetically correct if and only if the carry condition flag (C) is set to zero.
the C flag in the CPSR register is always set to the appropriate value, or , each time an addition or subtraction is performed by the CPU.
In particular, the CPU will not ignore the C flag when there is no carry; it will actively set it to zero.

3.4 Signed Integers

3.5 Exercises

3.6 Arithmetic Errors—Signed Integers

The number of bits used to represent a value is determined at the time a program is written.
The flags register, CPSR, provides a bit, the overflow condition flag, V, for detecting whether the sum of two -bit, signed numbers stored in the two's complement code has exceeded the range allocated for it.

  1             >-- penultimate carry
  0001 0101
+ 0110 1111
---------
  1000 0100
 Carry=0 
The V flag is equal to the exclusive or of carry and penultimate carry:

V = C  ^ penultimate carry
where ‘^’ is the exclusive or operator.

The CPU does not consider integers as either signed or unsigned.

  • If your algorithm treats the result as unsigned
  • the carry condition flag (C) is zero if and only if the result is within the -bit range; V is irrelevant.
  • If your algorithm treats the result as signed
  • the overflow condition flag (V) is zero if and only if the result is within the -bit range; C is irrelevant.
Both C and V are set according to the rules of binary arithmetic by each arithmetic operation.
After each addition or subtraction operation the program should check the state of C for unsigned integers or V for signed integers and at least indicate when the sum is in error.

3.7 Exercises

Chapter 4 Basic Data Types

4.1 C/C++ Basic Data Types

4.2 Hexadecimal to Integer Conversion

4.3 Programming Exercise

4.4 Bitwise Logical Operations

4.5 Programming Exercise

4.6 Other Codes

Chapter 5 Boolean Algebra

5.1 Boolean Algebra Operations

5.2 Exercises

5.3 Canonical (Standard) Forms

5.4 Exercise

5.5 Boolean Function Minimization

Chapter 6 Logic Gates

6.1 Crash Course in Electronics

6.2 CMOS Transistors

6.3 NAND and NOR Gates

6.4 Exercise

Chapter 7 Logic Circuits

7.1 Combinational Logic Circuits

7.2 Programmable Logic Devices

7.3 Sequential Logic Circuits

7.4 Designing Sequential Circuits

7.5 Memory Organization

Chapter 8 Central Processing Unit

ARM CPUs used in different Raspberry Pi models.
The 64-bit ARM processor in the Raspberry Pi 3 B can be run in either AARCH32 (32-bit) or AARCH64 (64-bit) state.

8.1 Overview

CPU block diagram. The CPU communicates with the Memory and I/O subsystems via the Address, Data, and Control buses.
  • Program Counter
  • contains the address of the next instruction to be executed. (Also called an Instruction Pointer.)
  • L1 Cache Memory
  • Very fast memory on the CPU chip.
    Many modern CPUs use two L1 cache memories organized in a Harvard architecture—one for instructions, the other for data. (See Section 1.2.) Its use is generally transparent to an applications programmer.
  • Instruction Register
  • Contains the instruction that is currently being executed.
  • Control Unit
  • Controls the activities of all the units in the CPU.
  • Register
  • A named group of several bytes of memory within the CPU.
  • Arithmetic Logic Unit (ALU)
  • Bus Interface
  • The means for the CPU to communicate with the rest of the computer system—memory and I/O devices.
    It contains circuitry to place addresses on the address bus, read and write data on the data bus, and read and write signals on the control bus.
    The Bus Interface on many CPUs interfaces with external bus control units that in turn interface with memory and with different types of I/O buses, e.g., Serial ATA, PCI-E, USB, etc.
  • Condition Flags
  • Bits in a status register that show results of many operations performed by the ALU.

8.2 CPU Registers

A portion of the memory in the CPU is organized into registers. Machine instructions access CPU registers by their addresses.

The registers are in the CPU, the assembler has predefined names for the registers.
Applications programmers have access to 16 integer registers in the AARCH32 (32-bit) state, r0 — r15.
The names of the registers and their usage in AARCH32 state are summarized

  
Register	Register	
Name        Number      Usage
---------------------------------------
r0–r10      0–10	    General Purpose
r11 or fp   11	        Frame Pointer
r12 or ip   12	        Intraprocess scratch
r13 or sp   13	        Stack Pointer
r14 or lr   14	        Link Register
r15 or pc   15	        Program Counter

In AARCH64 (64-bit) state applications programmers have access to 30 integer registers.

  
Full 64-bit	        Low 32-bit      Register	
Register Name	    Register Name   Number    Usage
-------------------------------------------------------------
r0–r30 or x0–x30    w0–w30          0 - 30    General Purpose
sp                  wsp            31        Stack Pointer
xzr                 wzr            virtual   Zero Register
Using wn, where ,n=0,1,…,30, refers to the low-order 32-bit portion of the register.

If an instruction reads these 32 bits from the register, bits 63–32 are ignored, and if an instruction writes to the 32 bits, bits 63–32 are set to zero.
Many instructions can access one byte in a register, which consists of the bits 7–0 in the specified register. And accessing two bytes at a time works on bits 15–0 in the specified register. This is specified in the instruction, not in the register name.

8.3 CPU Interaction with Memory

If store one byte 0xcd at location 0x7efff174, the control unit then
  1. places 0x7efff174 on the address bus
  2. places 0xcd on the data bus, and then
  3. places a “write” signal on the control bus.

8.4 Program Execution in the CPU

The CPU is programmed via the instruction register — whose bit pattern determines what the CPU will do.
Once that action has been completed, the bit pattern in the instruction register can be changed, and the CPU will perform the operation specified by this next bit pattern.

Most modern CPUs use an instruction queue.
Several instructions are waiting in the queue, ready to be executed.
Since instructions are simply bit patterns, they can be stored in memory.
The instruction pointer register always has the memory address of (points to) the next instruction to be executed.
In order for the control unit to execute this instruction, it is copied into the instruction register.

The senario is:

  1. A sequence of instructions is stored in memory
  2. The memory address where the first instruction is located is copied to the program counter
  3. The CPU sends the address in the program counter to memory via the address bus.
  4. Memory responds by sending a copy of the state of the bits at that memory location on the data bus, which the CPU then copies into its instruction register.
  5. The instruction pointer is automatically incremented to contain the address of the next instruction in memory.
  6. The CPU executes the instruction in the instruction register.
  7. Go to step 3.
Steps 3, 4, and 5 are called an instruction fetch.
Steps 3–7 make up a cycle, the instruction execution cycle,
The wfi (“wait for interrupt”) instruction places the CPU in an idle state, where it remains until an I/O device sends an interrupt signal to the CPU.
Just to understand that the wfi instruction stops the program execution cycle.

The instructions for a program are stored in a file.
When you indicate to the operating system that you wish to execute a program, the operating system locates a region of memory large enough to hold the instructions in the program, and then copies them from the file to memory.

8.5 Using gdb to View the CPU Registers

We will use the following program to illustrate the use of gdb to view the contents of the CPU registers.

/* gdbExample1.c
 * Subtracts one from user integer.
 * Demonstrate use of gdb to examine registers, etc.
 * 2017-09-29: Bob Plantz
 */

#include <stdio.h>

int main(void)
{
  register int wye;
  int *ptr;
  int ex;

  ptr = &ex;
  ex = 305441741;
  wye = -1;
  printf("Enter an integer: ");
  scanf("%i", ptr);
  wye += *ptr;
  printf("The result is %i\n", wye);

  return 0;
}
Compile the program for gdb debugging:

$ gcc -g -O0 -Wall -o gdbExample1 gdbExample1.c
  • The “-g” option tells the compiler to include debugger information in the executable program.
  • The “-Wall” option causes the compiler to warn you about many constructions that might be a programming error.
The register storage class modifier to request that the compiler use a CPU register for the int* ptr variable.


$ gdb ./gdbExample1    
    
Some additional commands that will be useful in this section:
  • lists ten lines of source code centered around the specified line number.
  • 
    (gdb) li 11
    6	
    7	#include <stdio.h>
    8	
    9	int main(void)
    10	{
    11	  register int wye;
    12	  int *ptr;
    13	  int ex;
    14	
    15	  ptr = &ex;    
        
  • set a breakpoint at line 18 then run the program
  • 
    (gdb) br 18
    Breakpoint 1 at 0x10478: file gdbExample1.c, line 18.
    (gdb) run
    Starting program: /home/pi/gdbExample1 
    
    Breakpoint 1, main () at gdbExample1.c:18
    18	  printf("Enter an integer: ");
        
    When line 18 is reached, the program is paused before the statement is executed
  • use the print command to view the value
  • 
    (gdb) print ex
    $1 = 305441741
    (gdb) print &ex
    $2 = (int *) 0x7efff430
    	
  • The help command will provide very brief instructions on using a command.
  • 
    (gdb) help x
    Examine memory: x/FMT ADDRESS.
    ADDRESS is an expression for the memory address to examine.
    FMT is a repeat count followed by a format letter and a size letter.
    Format letters are o(octal), x(hex), d(decimal), u(unsigned decimal),
      t(binary), f(float), a(address), i(instruction), c(char), s(string)
      and z(hex, zero padded on the left).
    Size letters are b(byte), h(halfword), w(word), g(giant, 8 bytes).
    The specified number of objects of the specified size are printed
    according to the format.  If a negative number is specified, memory is
    examined backward from the address.
    
    Defaults for format and size letters are those previously used.
    Default count is 1.  Default address is following last thing printed
    with this command or "print".
        
  • Examine momory in different formats
  •   
    (gdb) x/1dw 0x7efff430
    0x7efff430:	305441741
    (gdb) x/1xw 0x7efff430
    0x7efff430:	0x1234abcd
    (gdb) x/4xb 0x7efff430
    0x7efff430:	0xcd	0xab	0x34	0x12
      	
    Note:
    • 0xcd is stored in the byte at address 0x7efff430
    • 0xab is stored in the byte at address 0x7efff431
    • 0x34 is stored in the byte at address 0x7efff432
    • 0x12 is stored in the byte at address 0x7efff433
    This is due to the values being stored in the little endian.
  • Examine variables
  • 
    (gdb) print ptr
    $2 = (int *)  0x7efff430
    (gdb) print &ptr
    $3 = (int **) 0x7efff504    
        
    the ptr variable is located at address 0x7efff504 and its content is 0x7efff4300 , the address of the variable ex.
    It is important that you can distinguish between a memory address and the value that is stored there, which can be another memory address.
  • displays the current contents of the CPU registers
  • This program requests the compiler to allocate a register for the wye variable.
    
    (gdb) print wye
    $4 = -1
    (gdb) print &wye
    Address requested for identifier "wye" which is in register $r4    
        
    Registers are located in the CPU and do not have memory addresses.
    List of integer registers and their contents,
    
    (gdb) i r
    r0             0x1                 1
    r1             0x7efff674          2130703988
    r2             0x7efff67c          2130703996
    r3             0x1234abcd          305441741
    r4             0xffffffff          4294967295
    r5             0x0                 0
    r6             0x10368             66408
    r7             0x0                 0
    r8             0x0                 0
    r9             0x0                 0
    r10            0x76fff000          1996484608
    r11            0x7efff514          2130703636
    r12            0x7efff528          2130703656
    sp             0x7efff500          0x7efff500
    lr             0x76e6abe0          1994828768
    pc             0x10478             0x10478 <main+32>
    cpsr           0x60000010          1610612752
    fpscr          0x0                 0    
        
    • The first column is the name of the register.
    • The second shows the current bit pattern in the register, in hexadecimal. Notice that leading zeros are not displayed.
    • The third column shows some the register contents in 32-bit unsigned decimal.
    The conten of r4 is the same as that stored in the wye variable, 0xffffffff.
    (
  • n
  • Execute current source code statement of a program that has been running; if it's a call to a function, the entire function is executed.
  • s
  • Execute current source code statement of a program that has been running; if it's a call to a function, step into the function.
  • si
  • Execute current (machine) instruction of a program that has been running; if it's a call to a function, step into the function.

8.6 Programming Exercises

Chapter 9 Programming in Assembly Language

9.1 Program Organization


/* doNothingProg1.c
 * The minimum components of a C program.
 * 2017-09-29: Bob Plantz
 */

int main(void)
{
  return 0;
}
use the -S command line option to look at the assembly language that the compiler produces:

$ gcc -S -O0 doNothingProg1.c
  • -S
  • causes the compiler to create the .s file, which contains the assembly language equivalent of the source code.
  • -O0
  • tells the compiler not to do any optimization. For instructional purposes, we want to see every step of the assembly language. (This is upper-case “oh” followed by the numeral zero.)
This is not easy to read the gcc generated assembly code,

        .arch armv6
        .eabi_attribute 28, 1
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 2
        .eabi_attribute 30, 6
        .eabi_attribute 34, 1
        .eabi_attribute 18, 4
        .file   "doNothingProg1.c"
        .text
        .align  2
        .global main
        .arch armv6
        .syntax unified
        .arm
        .fpu vfp
        .type   main, %function
main:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 1, uses_anonymous_args = 0
        @ link register save eliminated.
        str     fp, [sp, #-4]!
        add     fp, sp, #0
        mov     r3, #0
        mov     r0, r3
        add     sp, fp, #0
        @ sp needed
        ldr     fp, [sp], #4
        bx      lr
        .size   main, .-main
        .ident  "GCC: (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110"
        .section        .note.GNU-stack,"",%progbits
Use this programmer's version for investigation:

@ doNothingProg2.s
@ Minimum components of a C program, in assembly language.
@ 2017-09-29: Bob Plantz 

@ Define my Raspberry Pi
        .cpu    cortex-a53
        .fpu    neon-fp-armv8
        .syntax unified         @ modern syntax

@ Program code
        .text
        .align  2
        .global main
        .type   main, %function
main:
        str     fp, [sp, -4]!   @ save caller frame pointer
        add     fp, sp, 0       @ establish our frame pointer

        mov     r3, 0           @ return 0;
        mov     r0, r3          @ return values go in r0

        sub     sp, fp, 0       @ restore stack pointer
        ldr     fp, [sp], 4     @ restore caller's frame pointer
        bx      lr              @ back to caller
The assembly language is line-oriented. That is, there is only one assembly language statement on each line, and none of the statements spans more than one line.
The following assembly language statement is equivalent to the machine lamguage "0xe3a03000":

mov r3, 0
Next, notice that the pattern of each assembly line falls into one of three categories:
  • comment
  • The ‘@’ character any place on a line designates the rest of the line as a comment.
  • Blank lines
  • for readibility
  • statements
  • each of the assembly language lines is organized into four possible fields:
    
    label:    operation    operand(s)    @ comment    
        
    • label
    • give a symbolic name to any line in the program. The memory location can be refered by this symbolic name.
    • operation
    • There are 2 types of operations:
      • An assembly language mnemonic
      • An assembler directive or pseudo op begins with the period (‘.’)
    • operand
    • comment
    The assembler requires at least one space or tab character to separate the fields.
identifier are very similar to those for C/C++.
Identifiers are called Symbol Names. Case is also significant.
  • Compiler-generated labels begin with the ‘.’ character
  • many system related names begin with the ‘_’ character.

Assembler Directives

Assembler directives are directions to the assembler to take some action or change a setting.
Assembler directives do not represent instructions, and are not translated into machine code.

For this assembler, all directives begin with a “.” or “#” (the comment is a #), and the directive must exist on a separate line from any other assembler directive or assembler instruction.
There are 4 main assembler directives:

  • .text
  • The .text directive tells the assembler that the information that follows is program text (assembly instructions), and the translated machine code is to be written to the text segment of memory.
    When a source code file is translated into machine code, an object file is produced.,br> The object file format used is Executable and Linking Format (ELF).
    Programs that store information in ELF files store it in sections. The ELF standard specifies many different types of sections, each depending on the type of information stored in it.
    The .text directive specifies that when the following assembly language statements are translated into machine instructions, they should be stored in a text section in the object file. Text sections are used to store program instructions in machine code format.
  • .data
  • The .data directive tells the assembler that information that follows is program data. The information following a .data instruction will be data values, and will be stored in the data segment.
  • .label
  • A label is an address in memory corresponding to either an instruction or data value. It is just a convenience so the programmer can reference an address by a name.
  • .number
  • The number directive tells the assembler to set aside 2 bytes of memory for a data value, and to initialize the memory to the given value. It will often be used with the .label directive to set a label to a 2-byte memory value, and initialize the value
GNU/Linux divides memory into different segments for specific purposes when a program is loaded from the disk. The four general categories are:
  • Text Segment
  • Where program instructions and constant data are stored.
    The operating system prevents a program from changing anything stored in the text segment, treating it as read-only memory during program execution. Also called code segment.
  • Data Segment
  • Where global variables and static local variables are stored.
    Both read-only and read-write data segments can occur in a program. It remains in place for the duration of program execution.
  • Stack Segment
  • Where automatic local variables and the data that links functions are stored.
    It is read-write memory that is allocated and deallocated dynamically as the program executes.
  • Heap Segment
  • The pool of memory available when a C program calls the malloc function (or C++ calls new).
    It is read-write memory that is allocated and deallocated by the program.

The operating system needs to view an ELF file as a set of segments. One of the functions of the ld program is to group ELF sections together into segments so that they can be loaded into memory.
When the operating system loads the program into memory, it uses the segment view of the ELF file. Thus, for example, the contents of all the text sections will be loaded into the text segment of the program process.
The readelf program is also useful for learning about ELF files.

The AArch32 target selection directives specify code generation parameters for AArch32 targets.
The following three directives identify the characteristics of the ARM processor this code will run on:


.cpu     cortex-a53
.fpu     neon-fp-armv8
.syntax unified         @ modern syntax
There are many variations of the ARM architecture, and the assembler needs to know which one this code is intended for. The appropriate values for each directive for the various Raspberry Pi models are given below:
Raspberry Pi.cpu.fpu
Pi Zero
Pi 1 A+arm1176jzf-svfp
Pi 1 B+
Pi 2 Bcortex-a7neon-vfpv4
Pi 3 Bcortex-a53neon-fp-armv8

The first assembler directive in the text segment has one operand, 2,


.align  2
For the ARM, this tells the assembler to ensure that the lowest two bits of the starting address of the generated code are zero.
That is, the addressing is adjusted, incremented if necessary, to be a multiple of four.
Each machine instruction is four bytes long, so this ensures proper alignment of the instructions in memory.

The .global directive makes the name globally known, code outside this file can refer to this name.


.global  main  
  
When a program is executed, the operating system does some preliminary set up of system resources. It then starts program execution by calling a function named “main,” so the name must be global in scope.

The following declares the label, main, as the name of a function in the program.


.type   main, %function  
This simply identifies the original C source code file,

.file:  "doNothingProg1.c"
The .size directive gives the number of bytes in the code, and the .ident directive lists the version of the compiler that produced this assembly language.

These directives are used to describe the characteristics of the statements that follow.
They are not translated into actual machine instructions, and none of them occupy any memory in the finished program.

9.2 First Assembly Language Instructions

To see the details of the instruction, you need to read the ARM manuals,
  • ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition for 32-bit
  • Architecture Reference Manual ARMv8, for ARMv8-A architecture profile for 64-bit
The ARM actually provides a second instruction set called “Thumb.” It allows for either 16-bit or 32-bit instructions.
I will use ‘%’ to add my comments.

9.2.1 Some Notation

The syntax that ARM uses for their assembly language is called Unified Assembler Language (UAL).
The assembler, as, recognizes the UAL syntax if you use the assembler directives to identify the ARM model correctly.
the version of gcc currently (August 2016) running on Raspbian uses pre-UAL syntax. The differences are minor.
For example, the compiler-generated assembly language uses a ‘#’ character to prefix each literal value:

        str     fp, [sp, #-4]!
But the UAL syntax specifies that the ‘#’ character is optional.
The ‘#’ character for immediate values will not be used in my examples in this book.
To use the UAL syntax when writing your own assembly language programs will become very important when we get to the floating-point instructions.

9.2.2 Condition Codes

Most AARCH32 ARM instructions have an option that allows you to specify that it will be executed only if a specific setting of the condition flags exists.
These settings are expressed by adding a mnemonic Condition Code to the instruction mnemonic.
Mnemonic suffixes for conditional execution of instructions. Meaning depends on whether the values are integers or floats:
The cond column shows the machine code.

9.2.3 Shift Options

Many ARM instructions include an option to shift one of the data values during the operation that the instruction performs.
Mnemonic codes for adding shifts to instructions. The ‘#’ is optional.
As an example of how the shifting syntax is used,

mov     r0, 12  #store 12 in r0
mov     r1, 60  # store 60 in r1
add     r2, r0, r1, lsl 2  #  lsl #2 left shifts the value in r1 two bit, r1 = 240
would store 252 in r2.

To let the amount of the shift be under program control,


mov     r0, 12
mov     r1, 60
mov     r3, 2
add     r2, r0, r1, lsl r3

9.2.4 First Instructions

Even though the program does nothing, it uses six instructions.
  • MOV
  • Copies (moves) a value into a register. Format
    
    MOV{S}{<c>}   <Rd>, #<const>           % immediate
    MOV{S}{<c>}   <Rd>, <Rm>               % register    
        
    • S
    • If ‘S’ is present, the condition flags are updated according to the value being moved.
      If absent, the condition flags are not changed.
    • c
    • <c> is the condition cod
    • Rd
    • specifies the destination register
    • Rm
    • the source register
    • const
    • [-257 , +256]
  • MVN
  • Copies (moves) the complement (bitwise NOT) of a value into a register. Format:
    
    MVN{S}{<c>}   <Rd>, #<const>           % immediate
    MVN{S}{<c>}   <Rd>, <Rm>{, <shift>}    % register
    MVN(S}{<c>}   <Rd>, <Rm>, <type> <Rs>  % register-shifted register        
            
  • ADD
  • Adds two integers. Format:
    
    ADD{S}{<c>}  {<Rd>,} <Rn>, #<const>           % immediate
    ADD{S}{<c>>}  {<Rd>,} <Rn>>, <Rm>{, <shift>}    % register
    ADD{S}{<c>>}  {<Rd>,} <Rn>>,  <Rm>, <type> <Rs>  % register-shifted register    
        
  • SUB
  • Subtracts two integers. Format:
    
    SUB{S}{<c>}   {<Rd>,} <Rn>, #<const>           % immediate
    SUB{S}{<c>}   {<Rd>>,} <Rn>, <Rm>{, <shift>}    % register
    SUB{S}{<c>}   {<Rd>,} <Rn>, <Rm>, <type> <Rs>  % register-shifted register 
        
  • BX
  • Branches to another location in the program. The address of that location is in a register. Format:
    
    BX{<c>}    <Rm>
        
    The value in the Rm register is moved to the pc, thus causing program execution to branch to that location.
    The value in Rm does not change.
  • LDR
  • Loads a word from memory into a register. Format:
    
    LDR<c>  <Rt>, <label>                  % Label
    LDR<c>  <Rt>, [<Rn>{, #+/-<imm>}]      % Offset
    LDR<c>  <Rt>, [<Rn>, #+/-<imm>]!       % Pre-indexed
    LDR<c>  <Rt>, [<Rn>], #+/-<imm>        % Post-indexed    
        
    • <Rt> is the destination register, and <Rn> is the base register
    • <label> is a labeled memory address
    The memory address to load the word from is determined the following way:
    • label form
    • the address corresponding to the <label>
    • offset form
    • the signed integer, <imm>, is added to the value in the base register, <Rn>, the value at this address is loaded into <Rt>, but the base register is not changed.
    • Pre-indexed form
    • the signed integer is added to the value in the base register, <Rn>, the base register is updated to the new address, and then the value at this new address is loaded into <Rt>.
    • Post-indexed form
    • the value in the base register, <Rn>, is used as an address, and the value at that address is loaded into <Rt>. Then the signed integer is added to the value in the base register.
  • STR
  • Stores a word from a register into memory. Format:
    
    STR<c>>  <Rt>, <label>                  % Label
    STR<c>  <Rt>,  [<Rn>{, #+/-<imm>}]      % Offset
    STR<c>  <Rt>, [<Rn>, #+/-<imm>]!       % Pre-indexed
    STR<c>  <Rt>>, [<Rn>], #+/-<imm>        % Post-indexed    
        
    • <Rt> is the source register, and <Rn> is the base register.
    • <label> is a labeled memory address.

9.2.5 Code Walkthrough

每一個函數被執行時都有一個frame代表那函數的記憶體使用區,
指著目前函數區域變數開始存放的位址的系統變數則叫作 frame pointer。

A call stack is composed of stack frames .
The stack frame at the top of the stack is for the currently executing routine, which can access information within its frame (such as parameters or local variables) .
The stack frame usually includes at least the following items (in push order):

  • the arguments (parameter values)
  • passed to the routine (if any);
  • the return address back to the routine's caller
  • space for the local variables of the routine (if any).
When a subroutine starts running, the frame pointer and the stack pointer contain the same address.
While the subroutine is active, the frame pointer, points at the top of the stack. (stacks grow downward)
  1. 
    str     fp, [sp, -4]!   @ save caller frame pointer
    	
      This instruction
    • first determines a memory address by subtracting 4 from the address in the sp register and updating the sp register to this new address.
    • It then stores the address in the fp register in memory at this new address.
    The ‘!’ character following the [sp, -4] construct causes the value in the sp register to be modified by the numerical value (). So the value in sp is less after this instruction is executed.
    Each function in the program has its own area of the stack, known as a Stack Frame.
    The function keeps track of where its frame is by maintaining its memory address in the fp register.

9.3 Creating a Program in Assembly Language

9.4 Programming Exercises

9.5 Assemblers and Linkers

Chapter 10 Structure of the main Function

10.1 Passing Arguments in Registers

10.2 The Stack

10.3 Stack Management In a Function

10.4 Programming Exercise

10.5 Local Variables on the Stack

10.6 Programming Exercise

10.7 Data Storage in Memory

留言

熱門文章