STM32H7 code execution time question

DWWelch · ‎2024-07-19

Hello All,

I needed a very fast vector addition routine for the H7 so I wrote it in assembly. However, I discovered that very small changes in the code caused huge differences in execution time. Both of the following functions add a 16-bit vector to a 32-bit sum vector. The vectors are 32-bit word aligned. The sum vector is in DTCM and the other is in external SRAM. The first routine adds sequential data to the sum and the second adds every fourth point of a 4x larger vector to the same size sum. So both process the same amount of data, but the first is 3 times faster. Does anyone know what would cause such a large difference in execution time for these nearly identical functions?

Thanks

Dan

3X faster one

loop:

LDRH r3, [r0, r2, lsl #1] // load 16bit raw data in r3

LDR r4, [r1, r2, lsl #2] // load 32bit sum in r4

ADD r4,r3 // add raw data to sum

STR r4, [r1, r2, lsl #2] // store new sum

SUBS r2,#1 // next data point

BPL loop

Slower one

loop:

LDRH r3, [r0, r2, lsl #1] // load 16bit raw otdr data in r3

LDR r4, [r1, r2] // load 32bit sum in r4

ADD r4,r3 // add raw data to sum

STR r4, [r1, r2] // store new sum

SUBS r2,#4 // next data point

BPL loop

TDK · ‎2024-07-19

The second uses 4x the total memory, which makes cache misses 4x more often. Memory gets loaded one cache page at a time. I don't know the size of this offhand, but it's on the order of 32 bytes. In the first loop, you have 8 data points per cache page. In the second, you only have 2.

What you're seeing is that keeping data within cache makes a huge difference in terms of speed.

If you feel a post has answered your question, please click "Accept as Solution".

View solution in original post

waclawek.jan · ‎2024-07-19

Where do you effector that loop from (memory, alignment)?

Cortex-M7 is not your friendly cycle-precise core, but an overcomplicated caching, mildly superscalar beast.

JW

TDK · ‎2024-07-19

The second uses 4x the total memory, which makes cache misses 4x more often. Memory gets loaded one cache page at a time. I don't know the size of this offhand, but it's on the order of 32 bytes. In the first loop, you have 8 data points per cache page. In the second, you only have 2.

What you're seeing is that keeping data within cache makes a huge difference in terms of speed.

If you feel a post has answered your question, please click "Accept as Solution".

AScha.3 · ‎2024-07-19

Hi,

just for info : did you write it in plain C , optimizer set to -Ofast ?

How this compares to your asm ?

If you feel a post has answered your question, please click "Accept as Solution".

DWWelch · ‎2024-07-19

I am using ICache, but I'm not sure on the instruction alignment.

Dan

DWWelch · ‎2024-07-19

The larger vector is in external SRAM and DCache is enabled. So your answer makes sense, but the SRAM is running at the CPU clock speed with 1 wait state and the read is only one instruction in the loop so I don't think it would make it three times slower overall.

Thanks,

Dan

DWWelch · ‎2024-07-19

I will give this a try with the slower code and see if the compiler knows some magic.

Dan

tjaekel · ‎2024-07-19

My five cents: DCache involved vs. not (data on DTCM). ICache maybe as well (first iteration of loop is slower).
Put the code on ITCM, the data on DTCM - this should give the fastest (and predictable speed).

If you need data in SRAM - consider the penalty for a Cache Miss (reading one entire cache line (often 32 bytes)).
The speed depends also how DCache for SRAM is configured (as "write-allocate" vs. "write-back").

Rules of thumb: Caches enabled make it slow the first time (for D and ICache). Depending on size of data (address range) - DCache could slow down by Cache Miss. On writing via DCache: cache policy (esp. write-through) or to evict cache lines to make space - can make it (randomly) slower.

I think, you observe the effect of caches involved (pretty unpredictable and for sure slower for the first iteration).

DWWelch · ‎2024-07-22

I will time both solutions with DCache disabled and post the results.

Dan

DWWelch · ‎2024-07-23

TDK and tjaekel are correct. The entire issue is DCache. When I disabled Dcache, both functions ran at the same speed. The slower function ran faster with Dcache off than with it on! Reloading the cache delays bus access time for non-sequential reads.

Dan