Performance varies notably when the code alignment changed by 2 bytes?

KMaen · ‎2021-04-25

I am running Thumb-2 instruction code on STM32F750N8.

I am seeing non-negligible performance number variation depending on whether I insert one NOP right before a tight loop (and changing the address of each instruction by halfword). Other parts of the code are not touched at all.

I am not sure why this is happening. Does Thumb-2 instruction (especially, branches) run faster or slower depending on whether they are halfword- or word-aligned? Or can there be any other explanation?

Below is my very simple code, which simply loops around and does nothing.

0x80001d6: bf00 nop

0x80001d8: 3c01 subs r4, #1

0x80001da: d1fc bne.n 80001d6

This very simple 3 line code, when I loop for 10000000 times, takes about 100ms.

However, when I add nop at the beginning (so that the addresses move by 2 bytes to 0x80001d8, 0x80001da, 0x80001dc), the execution time is significantly reduced to 75ms.

I have tried disabling the I-cache and D-cache, and turned off the flash prefetcher and ART accelerator, but a similar phenomenon was still there.

Is there any possible explanation for this? What I thought was:

1. Is halfword-aligned instructions or halfword-aligned branch slower?

2. Can this somehow be related to dual-issue?

3. Can this be because I am crossing some sort of a page/bank boundary?

4. Can this be vendor-specific or is this something about the ARM architecture

I have searched a lot, but have not seen any relevant info.

Any help will be appreciated.

Thank you,

waclawek.jan · ‎2021-04-25

ARM does not publish detailed timing information. Generally, Cortex-M processors fetch instructions in words, so I would presume the branch target being unaligned means that the internal logic inserts a NOP as the first halfword of the first word. But it's only a speculation.

So, IMO, 1. branch, 2. no, 3. no, 4. no, Cortex-M specific.

JW

Ozone · ‎2021-04-26

In addition, Cortex M instructions are mostly 16-bit wide, with occasional 32-bit long instructions.

Fetches from Flash are supposedly 128 bit wide (for ST MCUs), to hide waitstates of the slower Flash.

The caches of the F7 add another layer of complexity.