Weird cycle behavior when running code from SRAM. 4 cycles per instruction instead of 1. What could be the cause?

LukasG · ‎2022-09-08

Board: Nucleo-144

Chip: STM32L4R5ZIT6P

Hey all!

I have a curios problem when executing code from SRAM.

My firmware lies in flash and occasionally jumps to a externally loaded binary in SRAM. As the cycle count of this binary is quite crucial, I used the DWT to read the cycles and while stepping through the code I realized that when I execute code in SRAM it takes exactly 4 cycles for a simple instruction (like nop, add, it takes more cycles for load/store ops) and not 1, as you would expect. I find this behavior quite weird and I can't really explain why execution takes longer in SRAM.

More details:

I created a section in SRAM1 (from 0x20000000 to 0x20030000) to hold the dynamic binary. This should be fine as SRAM1 is connected to the I-Bus and the D-Bus, the only thing that rung the alarm-bells was that the reference manual (rm0432,revision 9) states on p.107, that physical remap should be enabled for maximum performance.

I didn't want to do this, so I moved the section to SRAM2, where (according to the reference manual, same page) execution can be performed with maximum performance. However, this didn't solve my problem.

Furthermore, I thought it could have something to do with my clock configuration, but I couldn't find any problems there. The MCU is running at 1 MHz using the MSI oscillator, configured by CubeMX.

I am running out of ideas on what to try and what causes this inconsistency. Cycle-behavior in flash is how you'd expect it.

Does anybody have any idea on what might cause this sort of cycle behavior?

waclawek.jan · ‎2022-09-08

"stepping through" is a bad indicator of performance, debugging is intrusive in complex and undocumented ways and single-stepping interferes with real-time features like pipelines, write buffers, etc.
reads through S-port (i.e. above 0x2000'0000) in CM4 are inherently penalized by one extra cycle
there will be collisions on the S-port with data accesses from the processor (constants, variables, data, stack) -- both this and previous item is the reason for remap, read the STM32F4xx buses and bus matrix portion of Technical Update 1
there may be collisions with other accesses to the same memory, either from the processor (data, stack) or other busmasters
there may be some penalty because of arbitration on the busmatrix (this is simply undocumented)
RM is not the best source of precise information - in fact, there's no such source, exact timing in the 32-bitters is generally dealt with by handwaving

JW

View solution in original post

waclawek.jan · ‎2022-09-08

"stepping through" is a bad indicator of performance, debugging is intrusive in complex and undocumented ways and single-stepping interferes with real-time features like pipelines, write buffers, etc.
reads through S-port (i.e. above 0x2000'0000) in CM4 are inherently penalized by one extra cycle
there will be collisions on the S-port with data accesses from the processor (constants, variables, data, stack) -- both this and previous item is the reason for remap, read the STM32F4xx buses and bus matrix portion of Technical Update 1
there may be collisions with other accesses to the same memory, either from the processor (data, stack) or other busmasters
there may be some penalty because of arbitration on the busmatrix (this is simply undocumented)
RM is not the best source of precise information - in fact, there's no such source, exact timing in the 32-bitters is generally dealt with by handwaving

JW

LukasG · ‎2022-09-08

Hey @Community member,

thanks a lot for your quick response and for your answer! I thought the same thing about debugging. That's why I also executed binaries, that returned the cycle-count of the binary- sadly it didn't change anything about this cycle-behavior.

The reads through the S-Port might explain it. Is there anywhere I can read up on this penalization? This would at least explain 1 out of the 3 extra cycles. I also tested running the binaries exclusively on SRAM3, where only the S-Bus is available, which interestingly changed the behavior from 4 to 9 cycles per instruction.

I am not sure if your other 2 points apply, as I am consistently getting the same cycle count for various binaries of same length (I tested a couple of thousand binaries with this setup - fixed set of instructions btw.) as I need a fixed length of cycles. Stack is in a different SRAM.

Good to know that the RM isn't that precise!

(EDIT) your edit about the collisions on the S-port most likely explains it. I will check the technical update.

LukasG · ‎2022-09-08

Your two points on the S-Port seems to explain my trouble @Community member. Thanks a lot! I guess there is not much I can do about this penalization. :) I suppose I could try to do the physical remap of SRAM1 as this seemed to increase performance quite a bit.

waclawek.jan · ‎2022-09-08

> this seemed to increase performance quite a bit

Yes, but note that the best solution meticulously splits instruction fetches to I port, program constants (literals) to D port (i.e. program runs from the 0-2000'0000 area) and data to S port (i.e. they are still mapped to above 2000'0000).

Also, benchmarks are... well... always different from the real world.

JW

LukasG · ‎2022-09-08

Awesome! Thanks for taking the time and giving such a thorough answer. I will test this later today or at the latest tomorrow - I will send an update with my findings!

waclawek.jan · ‎2022-09-08

I wouldn't expect differences between SRAM2 and SRAM3...

What RM says about SRAM2 and performance is misleading - they meant to say, that SRAM2 is permanently aliased at both the I/D bus and S bus, at two different address ranges. ST is not famous for meticulous wording in their manuals.

Also, maybe try to start benchmarking with something brutally simple, like a linear bunch of NOPs; then maybe try to find some 32-bit "NOP-like" instruction to see difference imposed by pipelining and the 16/32-bit paradigm; and then try to add loops with loop variable strictly in registers (you can try to write a bit of inline asm for this, it's not that complicated as long as you don't want anything sophisticated from it). That would perhaps more fairly indicate the "raw" performance of given memory, without any potential conflicts with data accesses.

JW