ARM32F Unaligned Data Support

pic_micro · ‎2014-11-25

Posted on November 25, 2014 at 18:55

Dear All,

As we can see in the attached picture the ARM support Unaligned and Aligned data support

Please advice what mechanism is using by arm to identify two or three or four operand at single memory address

Tesla DeLorean · ‎2014-11-25

Posted on November 25, 2014 at 18:59

Please advice what mechanism is using by arm to identify two or three or four operand at single memory address.

A different memory address?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

stm322399 · ‎2014-11-25

Posted on November 25, 2014 at 19:09

This slide is garbage. What dumb processor is not able to read a byte from a location that is not word aligned ?

The teacher mixes unaligned read (read a word from not word-aligned location) and variables allocation.

Tesla DeLorean · ‎2014-11-25

Posted on November 25, 2014 at 19:11

The 32-bit long takes 4 bytes, if the first has an aligned address of 0x20001000, the second is at 0x20001004

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

carl2399 · ‎2014-11-25

Posted on November 25, 2014 at 22:21

This slide doesn't capture the essence of what is meant by unaligned data support in the CM3 (or at least STM32)!

The long word (32 bits) can be placed at (addr % 4) or ((addr % 4) + 1) or ((addr % 4) + 2) etc. The only time that long words and words are

required

to be aligned to their respective boundaries is when using the DMA functions and also when using the load /store multiple registers (possibly also push and pop instructions). The only downside of having unaligned data is that it requires multiple memory accesses to read / write the data, but for the most part the overhead is scarcely noticeable.

The support for unaligned data was actually the cause of a bug in an older version of the GCC C++ library function for implicit memcpy, as it was using the load / store multiple registers to ''optimise'' the memcpy - which didn't work at all well when the items being copied where not aligned.

pic_micro · ‎2014-11-26

Posted on November 26, 2014 at 17:37

This is bit clear idea to me thanks

as well as same procedure is using Flash ram ? . I mean, Since PC( Program Counter ) is increment by 4

Tesla DeLorean · ‎2014-11-26

Posted on November 26, 2014 at 17:50

Thumb instructions are mostly 16-bit, so PC += 2, the STM32 cannot run 32-bit ARM instructions.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waclawek.jan · ‎2014-11-26

Posted on November 26, 2014 at 18:14

> as well as same procedure is using Flash ram ? .

> I mean, Since PC( Program Counter ) is increment by 4

Program fetch is independent from FLASH - FLASH can be read/written as data memory and program can run from outside FLASH.

Cortex-M runs exclusively in Thumb mode, i.e. instruction width is 16 bit (halfword; there are one-halfword and two-halfword instructions) and instructions have to be aligned to 16-bit boundary (i.e. instruction address LSB = 0).

As a matter of fact, instructions *are* fetched in 32-bit chunks through the I-bus, but alignment is not required; however, if there are two-halfword instructions as the target of branch, there may be some speed penalty as two fetches may be needed. This is why some compilers with certain settings tend to align functions to 32-bits. For details, see the Prefetch Unit description in the Cortex-M3 Technical Reference Manual.

JW

Tesla DeLorean · ‎2014-11-26

Posted on November 26, 2014 at 18:21

Then you have flash lines on the F2/F4 being 128-bit wide, with the ART barrel shifting cached data into the prefetch port.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waclawek.jan · ‎2014-11-26

Posted on November 26, 2014 at 22:43

> Then you have flash lines on the F2/F4 being 128-bit wide, with the ART barrel shifting cached data into the prefetch port.

Indeed.

So there might be an interesting speed penalty for rare (uncached) branches landing near the end of the line.

There also might be a benefit from rare (uncached) short jumps being replaced by long ''ite'' (if-then-else) instruction sequences - both by benefiting from the prefetches and from avoiding unnecessary cache fill. It's quite hard to estimate the relative cost of these. I wonder whether the costly commercial compilers account for these effects.

It's also a pity that the ST designers did not study the existing jumpcache designs - e.g. the 100MHz SiLabs '51 jumpcache has the possibility to lock cache lines - a minor (in transistor count) but potentially significant enhancement...

JW