Understanding alignment in STM32F405xG ld linker script

swinchen · ‎2014-01-10

Posted on January 10, 2014 at 17:10

Hello. I am looking at

https://github.com/mabl/ChibiOS/blob/master/os/ports/GCC/ARMCMx/STM32F4xx/ld/STM32F405xG.ld

, one of the GNU linker scripts for the ChibiOS project. I understand this script for the most part but I am very confused by the alignment of the sections. Namely:

Why are output and input sections of the startup and .text output sections aligned to 16 byte boundaries? I asked in the ChibiOS forum but the response did not make a great deal of sense... I was told that a single read is 128 bits (16 bytes) so these sections are aligned to avoid reading across one of these boundaries. Wouldn't that depend on the code being executed?
Given the above question... why are the constructors and destructors aligned to 4 byte boundaries? It seems as though they are executable code, much like the .text section.
In terms of the stacks: I understand from reading the AAPCS: ''The stack must also conform to the following constraint at a public interface: SP mod 8 = 0. The stack must be double-word aligned.'' which may explain why the stack is aligned as it is by the linker... but what happens if the code before a function call pushes a 32-bit value onto the stack? It seems as though it would no longer be 64-bit aligned.
Any particular reason for the 4 byte alignment of input sections in the .data and .bss sections? I am assuming the compiler handles alignment in each input section.

Thanks for the help!

Sam

#stm32f405-linker-scatter-gnu-ld

Tesla DeLorean · ‎2014-01-10

Posted on January 10, 2014 at 17:42

The reason to align a section is to permit things within that section to be placed up to that granularity. Multiple objects may feed into a single section.

All data and variables should be placeable at 32-bit boundaries. Data paths and memories are optimized and decoded on that premise.

For flash the key thing to understand is the line width of the memory. Flash is slow (figure 35ns whether you read 1 byte or a dozen), if you read a value that spans two lines it will be doubly slow. The flash line width in the F4 is 128-bits, the cache (ART) sitting in front of it is also 128-bit wide. Subroutines and branch targets tend to work better if aligned, the initial hit occurs if there is a cache miss, but the next 7 instructions fetch for free as part of the optimized prefetch path.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

chen · ‎2014-01-10

Posted on January 10, 2014 at 17:59

Hi

Yes, very strange. Technically - the linker is part of the Compiler and Linker suite (ie gcc in this case). The OS (ChibiOS) is something that just runs like any other program as far as the compiler/linker is concerned.

''Any particular reason for the 4 byte alignment of input sections in the .data and .bss sections?''

ARM Cortex M series are 32bit - so all asm instructions should be min 4 byte aligned.

''The stack must be double-word aligned.'' which may explain why the stack is aligned as it is by the linker... but what happens if the code before a function call pushes a 32-bit value onto the stack? It seems as though it would no longer be 64-bit aligned.''

As far as Stack alignment goes - my understanding is hat it MUST be even. The stack is not only used for return addresses - it is also used for local variables and passing parameters into/out of functions.

So what happens when you create a char as a local vairable?? - The compiler and the processor must make sure there is a dummy byte inserted into the stack. The stack can be any mix of 2byte/4byte/8byte words (as anybody who has had to hand unwind the tack knows).

If you are foolish enough to pass bulk data as input parameter - this get pushed onto the stack.

Exception here is strings - compiler knows to push a pointer onto stack.

The .text output section normally are ALL the executable code.

'' I asked in the ChibiOS forum but the response did not make a great deal of sense... I was told that a single read is 128 bits (16 bytes) so these sections are aligned to avoid reading across one of these boundaries.''

Well if that is what the ChibiOS does (reads 16bytes at a time) - then it makes sense to align to 16byte boundaries.

''why are the constructors and destructors aligned to 4 byte boundaries? It seems as though they are executable code, much like the .text section.''

Conceptually, objects are code and data bound together. However, the processor has no way to enforce this concept. It is upto the compiler to do this.

Yes, Constructors and Destructors are just functions and hence they must be 4byte aligned.

Constructors and Destructors are just functions hence they go in the .text section.

.bss sections are one of the sections for variable data. There is one for intialised data (bss

containing statically-allocated variables

) and one for un-initialised data (zerovars).

OK, having just read Clive1's post - some of it make more sense to me now.

os_kopernika · ‎2014-01-10

Posted on January 10, 2014 at 23:39

''Any particular reason for the 4 byte alignment of input sections in the .data and .bss sections?''

The data bus is 32-bit wide and such alignment speeds up the access to multibyte variables (no crossing boundaries)

''I am assuming the compiler handles alignment in each input section.''

I do not think so. The compiler has no knowledge about how to place those unless you explicitly use some attribute (though that is passed to the linker anyway).

''but what happens if the code before a function call pushes a 32-bit value onto the stack?''

Architecture Reference Manual: ''When the processor pushes the PSR value to the stack it uses bit[9] of the stacked PSR value to indicate whether it realigned the stack.''

swinchen · ‎2014-01-10

Posted on January 11, 2014 at 04:09

clive1: Where did you get that fancy graphic? I am confused by the block of 128-bit values. Thumb and Thumb-2 instructions either 16 or 32 bits so do they represent between 4 and 8 instructions? From the Architecture Manual ''While instruction fetches are always halfword-aligned, some load and store instructions support unaligned addresses.'' (so aligning to 16 bytes shouldn't be needed?) If I knew what the 128-bits represented I think I have a much better understanding. It looks like I need to read more about the ART as well.

: What I meant when I said ''I am assuming the compiler handles alignment in each input section.'' is that compiler must pack the data in a way that an input section does not violate the alignment (perhaps this is something the linker does as well?). How is an 8-bit value stored in flash? My guess would be that it would be stored in a 32-bit slot.

Also I found this linker script from TrueSTUDIO:

https://github.com/jeremyherbert/stm32-templates/blob/master/stm32f4-discovery/stm32_flash.ld

It aligns the .text section to 4 bytes. Is any less valid than the way ChibiOS aligns the instructions?

Thanks all for the responses.

Edit: Between 4 and 8 instructions, not 2 and 4. Doh!

os_kopernika · ‎2014-01-11

Posted on January 11, 2014 at 20:14

''compiler must pack the data in a way''

Ok, lets suppose some object file (let it be main.o) contains one .data section (-f-no-data-sections) filled with various variables. Among them structs, unions, uin8_t, uin64_t etc. Now I understand the question is how the (gcc?) compiler manages that?

For structs gcc places all the internal data one after another with padding in between unless you explicitly request -pack-structs or other switches (sizeof(somestruct) returns true size, with padding).
For other variables - I think it aligns data to the size of the variable (1 byte for uint8_t and 8 bytes for uint64_t ??). Not sure about that.

When you tick -fdata-sections switch then any variable is placed in its individual section. But the (ld) linker does not care about variables but only sections (which is the same with mentioned switch) and here the linker-script rules apply.

Tesla DeLorean · ‎2014-01-11

Posted on January 11, 2014 at 22:39

@samuel - It's from one of the STM32F2/4 brochures. I'll dig one up shortly.

Remember this is external to the core, as the core designs don't provide any implementation for caching or TCM. The ART can provide prefetch data at basically zero-cycles, compared to say SRAM where it will still take a cycle to fetch. So from flash while you might have a 5 cycle hit for the first 16-bit word, the next 7 can be feed at 0 cost.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2014-01-11

Posted on January 12, 2014 at 01:11

I've seen coverage in a couple of places, this seems to be one of the better ones from the F4 Seminars. See pg 33

https://drive.google.com/file/d/0B7OY5pub_GfILVJERmZ6TktEQkE/edit?usp=sharing

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

chen · ‎2014-01-13

Posted on January 13, 2014 at 11:41

Hi

''Also I found this linker script from TrueSTUDIO:

https://github.com/jeremyherbert/stm32-templates/blob/master/stm32f4-discovery/stm32_flash.ld

It aligns the .text section to 4 bytes. Is any less valid than the way ChibiOS aligns the instructions?''

OK Atollic TrueStudio uses gcc as the compiler/linker.

Yes, it aligns to 4 byte boundaries (due 32bit ARM architecture)

''Is any less valid than the way ChibiOS aligns the instructions?''

No, both ways are OK but in fact from what Clive1 said about the Flash and cache - it sounds like ChibiOS is trying to be clever and optimise Flash access for the STM architecture.

What it is doing is forcing a 128 bit read (hence taking advantage of the Flash width and Cache) every time it reads an instruction, a number of instructions (4) are all read into the cache.

Note : this does not guarantee ChibiOS runs faster then other Rtos' but is helps

swinchen · ‎2014-01-13

Posted on January 14, 2014 at 02:26

Thanks all, it is starting to come together. In the example below, which is an unoptimized function that adds two floats, I highlighted the first 128 bits of the function. Any idea how the ART handles only having half of the instruction stored in the cache? It is hard to tell from the presentation but perhaps it reads ahead and does not wait until a cache miss?

08000010 <add>:

8000010: b480 push {r7}

8000012: b083 sub sp, #12

8000014: af00 add r7, sp, #0

8000016: ed87 0a01 vstr s0, [r7, #4]

800001a: edc7 0a00 vstr s1, [r7]

800001e: ed97 7a01 vldr s14, [r7, #4]

8000022: edd7 7a00 vldr s15, [r7]

8000026: ee77 7a27 vadd.f32 s15, s14, s15

800002a: eeb0 0a67 vmov.f32 s0, s15

800002e: 370c adds r7, #12

8000030: 46bd mov sp, r7

8000032: f85d 7b04 ldr.w r7, [sp], #4

8000036: 4770 bx lr

This example brings up one more question (I can probably find it in the architecture reference manual): How does the core decode the combination of 16 and 32 bit instructions? The I-code bus is 32 bits so the core load 2 16-bit instructions at once? What happens if it tries to load a 16-bit and half of a 32-bit instruction?

I will start looking looking through the reference manual.

Thanks for the help, the ART makes a lot more sense to me now! This is much more complex than the AVRs and PICs I am used to!