1月 12, 2019

Memory Alignment Issue

Reference:

Alignment fundamentals

When a computer reads from or writes to a memory address, it will do this in word sized chunks (for example, 4 byte (32-bit) chunks on the MPC8360). We'll call the size in which a processor accesses memory its memory access granularity.

Data alignment means putting the data at a memory offset equal to processor's memory access granularity, which increases the system's performance due to the way the CPU handles memory. Most CPUs can access only memory aligned addresses.

To illustrate how a processor's memory access granularity affects memory access, let's compare the following tasks:

read 4 bytes from address 0 into the processor's register
read 4 bytes from address 1 into the same register

It would happen on a processor with 4-byte memory access granularity,

read 4 bytes from the aligned address 0 needs single read instruction
read 4 bytes from the un-aligned address 1 into the same register needs 2 read instructions

The following show an example of some memory addresses and their alignment on different architectures.

Linux: UNALIGNED MEMORY ACCESSES

Linux runs on a wide variety of architectures which have varying behavior when it comes to memory access.

The definition of an unaligned access

Unaligned memory accesses occur when you try to read N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0).
For example, reading 4 bytes of data from address 0x10004 is fine, but reading 4 bytes of data from address 0x10005 would be an unaligned memory access.
The context here is at the machine code level: certain instructions read or write a number of bytes to or from memory (e.g. movb, movw, movl in x86 assembly).

Natural alignment

When accessing N bytes of memory, the base memory address must be evenly divisible by N, i.e.


 addr % N == 0

Why unaligned access is bad

a summary of the common scenarios is presented below:

Some architectures are able to perform unaligned memory accesses transparently, but there is usually a significant performance cost.
Some architectures raise processor exceptions when unaligned accesses happen. The exception handler is able to correct the unaligned access, at significant cost to performance.
Some architectures raise processor exceptions when unaligned accesses happen, but the exceptions do not contain enough information for the unaligned access to be corrected.
Some architectures are not capable of unaligned memory access, but will silently perform a different memory access to the one that was requested, resulting in a subtle code bug that is hard to detect!

It should be obvious from the above that if your code causes unaligned memory accesses to happen, your code will not work correctly on certain platforms and will cause performance problems on others.

Code that does not cause unaligned access

For example, take the following structure::


 struct foo {
  u16 field1;
  u32 field2;
  u8 field3;
 };

You'd be expecting field2 to be located at offset 2 bytes into the structure, i.e. address 0x10002, but that address is not evenly divisible by 4.
Fortunately, the compiler understands the alignment constraints, so in the above case it would insert 2 bytes of padding in between field1 and field2.
Therefore, for standard structure types you can always rely on the compiler to pad structures so that accesses to fields are suitably aligned.

Similarly, you can also rely on the compiler to align variables and function parameters to a naturally aligned scheme, based on the size of the type of the variable.

At this point, it should be clear that accessing a single byte (u8 or char) will never cause an unaligned access, because all memory addresses are evenly divisible by one.

The optimal layout of the above example is::


 struct foo {
  u32 field2;
  u16 field1;
  u8 field3;
 };

For a natural alignment scheme, the compiler would only have to add a single byte of padding at the end of the structure.
The compiler is aware of the alignment constraints and will generate extra instructions to perform the memory access in a way that does not cause unaligned access.

Code that causes unaligned access

The following function taken from include/linux/etherdevice.h is an optimized routine to compare two ethernet MAC addresses( 48-bits, 6 bytes ) for equality::


  bool ether_addr_equal(const u8 *addr1, const u8 *addr2)
  {
  #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
     ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)));

 return fold == 0;
  #else
 const u16 *a = (const u16 *)addr1;
 const u16 *b = (const u16 *)addr2;
 return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0;
  #endif
  }

When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the hardware isn't able to access memory on arbitrary boundaries, the reference to a[0] causes 2 bytes (16 bits) to be read from memory starting at address addr1. This is understood to only work normally on 16-bit-aligned addresses.

Here is another example of some code that could cause unaligned accesses::


 void myfunc(u8 *data, u32 value)
 {
  [...]
  *((u32 *) data) = cpu_to_le32(value);
  [...]
 }

This code will cause unaligned accesses every time the data pointer parameter points to an address that is not evenly divisible by 4.

Avoiding unaligned accesses

The easiest way to avoid unaligned access is to use the get_unaligned() and put_unaligned() macros provided by the <asm/unaligned.h>l header file. "unaligned.h" is an architecture-dependent file.

read from an address


 u32 value = get_unaligned((u32 *) mem_addr);

write to an address


  put_unaligned(value, (u32 *) mem_addr);

These macros work for memory accesses of any length (not just 32 bits as in the examples above). Be aware that when compared to standard access of aligned memory, using these macros to access unaligned memory can be costly in terms of performance.

How does the ARM Compiler support unaligned accesses?

Older ARM processors require data load and stores to be to/from architecturally aligned addresses. This means:


LDRB/STRB          - address must be byte aligned
LDRH/STRH          - address must be 2-byte aligned 
LDR/STR            - address must be 4-byte aligned

Load/store instructions that act on multiple registers, for example LDM, are considered as working with multiple word quantities, so these instructions also require 4-byte aligned addresses. An unaligned load is one where the address does not match the architectural alignment.

On older processors, such as ARM9 family based processors, an unaligned load had to be synthesized in software. Typically by doing a series of small accesses, and combining the results.

The ARMv6 architecture introduced the first hardware support for unaligned accesses. ARM11 and Cortex-A/R processors can deal with unaligned accesses in hardware, removing the need for software routines.

Support for unaligned accesses is limited to a sub-set of load/store instructions:


LDRB/LDRSB/STRB
LDRH/LDRSH/STRH
LDR/STR

Instructions which do NOT support unaligned accesses include:


LDM/STM
LDRD/STRD

Further, unaligned accesses are only allowed to regions marked as Normal memory type, and unaligned access support must be enabled by setting the SCTLR.A bit in the system control coprocessor. Attempts to perform unaligned accesses when not allowed will cause an alignment fault (data abort).

How hardware supports unaligned accesses

In many cases a processor cannot generate an unaligned access on its interfaces to the memory system. This applies to caches, TCMs and the system bus. In these situations, the processor will generate a series of accesses, to implement the unaligned access. This is similar to the software routines used for previous processors.

For example:


    LDR r0, [0x8001]

Most modern ARM processors have 64-bit or 128-bit interfaces. In the above example a processor would typically read the 64-bit or 128-bit block contains bytes 0x8001, 0x8002, 0x8003 and 0x8004. Discarding the other bytes.

Another example:


    LDR r0, [0x81FC]

The four bytes of this load span both a 64-bit and 128-bit boundary. So with either interface width, the processor would have to perform two reads.

In both of these examples it is possible to see that unaligned accesses require more work by the hardware. While more efficient than the software routines required by previous processors, it is still less efficient than aligned accesses.

Pointer alignment in C
When compiling C, variables are by default architecturally aligned. A global of type int (or uint32_t) will be 4-byte aligned in memory. Similarly, a pointer of type int* is expected to contain a 4-byte aligned address.

Where this is not the case (or may not be the case) the variable or pointer MUST be marked with the __packed keyword. This is a warning to the compiler that this variable, structure or pointer is potentially unaligned. Technically, it reduces the expected alignment of the pointer to 1-byte. It is possible to set the alignment of all pointers to 1 by using the compiler command line switch --pointer_alignment=1, it which case the compiler will treat all accesses through pointers as though they may be unaligned.

Compiler assumptions
When compiling for a ARMv6 or ARMv7-A/R processor, the ARM Compiler will assume that it can use unaligned accesses.

The --no_unaligned_access flag tells the compiler not to knowingly generate unaligned accesses. What is the significance of knowingly?

As mentioned above, a pointer should contain an address with correct alignment for the type.

uint32_t* requires 4-byte alignment
uint16_t* requires 2-byte alignment
uint8_t* requires 1-byte alignment

For structures, the alignment is that of the most aligned member.

The compiler will generate code on the assumption that a pointer is correctly aligned. It does not add code to perform run-time checks. A pointer may contain an incorrectly aligned address for a number of reasons. A common cause is casting:

uint8_t tmp;
uint32_t* pMyPointer = (uint32_t*)(&tmp);

This code takes the address of an uint8_t variable, then casts that address as a uint32_t pointer. The compiler will still assume that pMyPointer is correctly aligned for a uint32_t pointer. The compiler might then generate code that results in an unaligned access unknowingly.

This can be avoided using the __packed qualifier:

__packed uint32_t* pMyPointer = (__packed uint32_t*)(&tmp);

Code Generation

When unaligned accesses are permitted, the compiler will continue to use instructions that support unaligned accesses (for example LDR and STR) for accesses through __packed pointers. However it will not use instructions such as LDM which do not support unaligned accesses.

When unaligned accesses are not permitted, either because the code is being built for an ARMv4 or ARMv5 processor, or because --no_unaligned_access is specified, the compiler will access __packed data by a performing a number of aligned accesses. Usually, this is done by calling a library function such as __aeabi_uread4().

Device Memory

Address regions that are used to access peripherals rather than memory should be marked as Device memory. Depending upon the processor, this may be configured in the Memory Protection Unit (MPU) or the Memory Management Unit (MMU). Unaligned accesses are not permitted to these regions even when unaligned access support is enabled. If an unaligned access is attempted, the processor will take an abort.

The compiler does not have any information on which address ranges are device memory, and it is therefore the responsibility of the person writing the code to ensure that accesses to devices are aligned. In practice, this usually is the case simply because peripheral registers are at aligned addresses. It is also usual to access peripheral registers through volatile variables or pointers, which restricts the compiler to accessing the data with the size of access specified where possible. For further information on the restrictions imposed on volatile types, please see section 7.1.5 of the Procedure Call Standard for the ARM Architecture.

It is also necessary to avoid using C library functions such as memcpy() to access Device memory, as there is no guarantee of the type of accesses these functions will use. If it is necessary to copy a buffer of memory to a Device memory, you should provide a suitable copying routine and call this instead of memcpy().

Performance

If code frequently accesses unaligned data, there may be a performance advantage in enabling unaligned accesses. However, the extent of this advantage will be dependent on many factors. Even though this support allows a single instruction to access unaligned data, this will often require multiple bus accesses to occur. Therefore the bus transactions performed by an unaligned access may be similar to those performed by the multiple instructions used when unaligned access support is disabled. The code without unaligned access support will have to perform various shift and logical operations, but on a multi-issue processor the execution time of these may be hidden by executing them in parallel with the memory accesses. There will also be a function call overhead when functions such as __aeabi_uread4() are used, though the impact of these may be reduced by branch prediction.

搜尋此網誌

I'm Jay's father