Understanding alignment in STM32F405xG ld linker script

swinchen · ‎2014-01-10

Posted on January 10, 2014 at 17:10

Hello. I am looking at

https://github.com/mabl/ChibiOS/blob/master/os/ports/GCC/ARMCMx/STM32F4xx/ld/STM32F405xG.ld

, one of the GNU linker scripts for the ChibiOS project. I understand this script for the most part but I am very confused by the alignment of the sections. Namely:

Why are output and input sections of the startup and .text output sections aligned to 16 byte boundaries? I asked in the ChibiOS forum but the response did not make a great deal of sense... I was told that a single read is 128 bits (16 bytes) so these sections are aligned to avoid reading across one of these boundaries. Wouldn't that depend on the code being executed?
Given the above question... why are the constructors and destructors aligned to 4 byte boundaries? It seems as though they are executable code, much like the .text section.
In terms of the stacks: I understand from reading the AAPCS: ''The stack must also conform to the following constraint at a public interface: SP mod 8 = 0. The stack must be double-word aligned.'' which may explain why the stack is aligned as it is by the linker... but what happens if the code before a function call pushes a 32-bit value onto the stack? It seems as though it would no longer be 64-bit aligned.
Any particular reason for the 4 byte alignment of input sections in the .data and .bss sections? I am assuming the compiler handles alignment in each input section.

Thanks for the help!

Sam

#stm32f405-linker-scatter-gnu-ld

Tesla DeLorean · ‎2014-01-13

Posted on January 14, 2014 at 02:54

Implementation details are hard to come by, one could use the core's cycle counter to make reasonable guesses of how ART functions.

The design of the ART isn't going to have an instruction spanning issue, the 32-bit the core reads will always fall on a 32-bit boundary, and the core will assemble the stream internally, and prefetch as it goes. One easy optimization for the flash would be to rack up the next read, but this may well not even get into the ART cache, but it could be ready if required. If it gets into the ART cache it's going to evict something else.

For something that's strapped outside the core, it does operate quite efficiently. As I recall the tests I did suggest it is faster than executing out of RAM, but less predictable.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

chen · ‎2014-01-14

Posted on January 14, 2014 at 11:18

Hi

'' How does the core decode the combination of 16 and 32 bit instructions?''

Technically, ARM cortex M4 are RISC processors, usually meaning that all instructions must be the same width.

HOWEVER, ARM to all compatibility between their different variants have sub instructions sets - often referred to as 'Thumb' I think there is a Thumb2 as well.

The Thumb instruction set is only 16 bits wide.

Mots tool chains (eg IAR and Atollic) allow you to set an option on use the native or the Thumb instruction set.

''What happens if it tries to load a 16-bit and half of a 32-bit instruction?''

I do NOT think the ARM can mix instruction set widths - they are all 16 bits or 32 bits (could be wrong though - I rarely look in that much detail)

Sadly, this further compilates the issue for you but you did ask about 16 bit wide instructions.

As Clive1 points out, the cache will load '1 width' (128bits) into the cache.

This may (or may not) help with execution speed. It depends on what is going to happen next.

ChilibOS by forcing the start of instructions on 128 bit boundary helps to work within the STM32 architecture. This does no guarantee it will be faster than another OS, just give it a better chance.