STM32H503 code execution performance issue

ArkadiuszRaj · ‎2025-05-27

Hello

for some specific usage I am evaluating possibilities of using CM33 with HCLK 250MHz for maximum code execution performance.

To do so I have crafted procedure in assembler and calculated the cycles using ARM reference manual. The code is doing only some data conversation from one data buffer into another one.

Then I place this code in SRAM1, enabling I-Cache, setting RCC and I try to measure code execution time.

Let’s say execution cycles calculated by hand is about 3000.

later work I am basing on the code generated by CubeMX and inject my bare metal stuff.

1) using DWT->CYCCNT I see no difference in execution from FLASH or SRAM1 - result is about 7900
2) using TIM6 started before my func and stopped after it thus TIM->CNT has time in 4ns intervals - formal cycles. Running from SRAM -same result.

Example of the „issue”

LDR r0, =label

During step debugging for this instruction the TIM6->CNT is incremented by 6 units.

Same for

TST r0, #1

6 units of CNT change thus 6 cycles. 6x slower than it should be executed.

My question is:

Is the STM32H5 value line prepared in the way that peripherals can but cpu core ca not work with 250MHz HCLK?

Or I am not initializing this CPU properly so with HCLK 250MHz I am achieving similar code performance as with G0 at 48MHz?

I feel I am missing something…

AScha.3 · ‎2025-06-02

>Also does this matter if all my test code is written in pure assembler?

You know, how these cpus working ? (and what the optimizer doing ?)

So maybe : the fastest code is generated by a compiler with optimizer, not by writing in asm.

Just try it ! And set optimizer -O2 or so...

btw

H5 is an M33 core, so read :

https://developer.arm.com/documentation#numberOfResults=48&q=cortex%20m33%20technical%20reference%20manual

https://developer.arm.com/documentation/100230/0100/Introduction/About-the-processor-architecture?lang=en

-> so just think: H563 description -> 250 MHz, 375 MIPS ...? Obviously it can execute more than 1 instruction per cycle, what the optimizer doing : arranges the code to be able to do this.

Only if you have the same "skills" , then your asm will get same speed.

If you feel a post has answered your question, please click "Accept as Solution".