memset() execution slower at some addresses

DApo.1 · ‎2020-11-27

Hello,

After some investigation was found that memset has different behavior executed from different places in flash. Data and instruction cache are off! Micro used is stm32h743xi.

Function is called with following arguments --> memset(dummy, 0, 64)

Its execution time is ~5us when function is placed at:

..., 0x8040c34, 0x8040c54, 0x8040c74, ...

Its execution time is ~1us when function is placed at:

..., 0x8040c3c, 0x8040c44, 0x8040c4c , 0x8040c5c, 0x8040c64, 0x8040c6c ...

Any ideas?

Thanks

DApo.1 · ‎2020-11-30

You caught me again :). I had to go down to 1 MHz timer frequency to avoid the overflow in the slow measurement when there is 64000.

Here are the corrected measurements:

bytes slow fast

B us us

64 5.11 0.99

640 45.4 5.22

6400 449 49

64000 4481 480

The good news is that instruction cache is solving the issue.

pavlo1r · ‎2025-04-02

There's nothing particularly special about the memset() function; similar behavior can be observed with any code. In my case, I was working with the STM32F767 microcontroller. I measured the runtime of a for-loop that writes to a variable in SRAM1 using a free-running timer. By systematically moving the code to different locations in flash memory, I discovered a distinct pattern: 11 addresses were fast, while 6 addresses were slow. The slowdown on the AXI bus (0x0800'0000) was nearly six times, and on the ITCM bus(0x0020'0000), it was three times. Throughout all tests ICACHE, DCACHE, ART and PREFETCH were disabled. Interestingly, even slight modifications to the code within the for-loop altered this pattern.

Pavel A. · ‎2025-04-02

> Interesting is that addresses where the execution is slower are +0x20 from each other

0x20 is exactly the FLASH "word" size, 256 bits as Jan has mentioned.

What is more interesting to me... can interrupt latency be affected in the same way?

> The good news is that instruction cache is solving the issue.

Indeed good news for tight looping code, but not for latency of (rarely occurring) interrupts?

As @pavlo1r tested on a 'F767, this is not specific to the bus matrix of 'H7.

pavlo1r · ‎2025-04-15

Both the STM32F767 and STM32H743xI implement the Cortex-M7 core. The issue of code position-dependent performance is not necessarily tied to a specific bus matrix, but rather to the behavior of the Cortex-M7's prefetch unit. This unit can be partially configured—for example, the branch target address cache (BTAC) used for branch prediction can be disabled.

While the instruction cache (ICache) generally improves overall performance, it may not directly solve the issue. In fact, predictability can sometimes decrease when the ICache is enabled. I still believe that the same code—particularly tight loops—can exhibit different performance characteristics depending on its memory location, even with the ICache active. In any case, tight loops should be avoided when possible due to these sensitivity issues.

The offset of +0x20 also sunds like the size of the prefetch queue. Even if the code is already present in the ICache, a poor branch prediction can flush the queue, requiring it to be refilled from the ICache.

If an interrupt occurs, the prefetch queue is typically invalidated, and there's only a slim chance that the target code is already in the ICache. This leads to increased interrupt entry latency