STM32U545 Execution Speed Slower Than Expected

xiangpu2001 · ‎2025-03-06

Software Environment:

Windows platform
Configured using STM32CubeMX
Compiled with STM32CubeIDE

Hardware Platform:

NUCLEO-U545RE-Q
MCU: STM32U545 (Cortex-M33, ARMv8-M)

Test Objective:
We initially noticed slow SPI performance (CS low to CS high) and decided to test the MCU performance by measuring the simplest instruction set to toggle a GPIO pin. We further simplified the HAL_GPIO_WritePin( ) to only write to the registers back to back as shown below:

And the Assembly equivalent:

Issue Description:

Since there are only 3 assembly instructions for each SET / REST in the above code, we expected the execution time to be about 6 instruction cycles. However, when we measure the actual toggle time, we actually measure about 10 instruction cycles for the resulting pulse period. For example with the SYSCLK & AHB2 CLK running @ 16 MHz, we see a period of 680 nS which represents about 11 instruction cycles. (We are assuming that each assembly instruction takes 1 instruction cycle to execute.)

Details of test:

Clock Configuration:

HCLK: 16 MHz
AHB2 / GPIO Clock: 16 MHz

Test Method:
After initializing the system, we toggle a GPIO pin in code and measure the switching speed.
For this test the GPIO speed was set to SLOW , FAST and VERY FAST. The output waveform did not change appreciably. Also, the GPIO clk was enabled using the default init call of __HAL_RCC_GPIOA_CLK_ENABLE();
For this test, no other peripheral or interrupts have been enabled.

Questions:

Is there a better method to test the MCU’s performance than toggling GPIO pin?
Is there an explanation for why the GPIO toggling takes so much longer to execute?
Are there any additional configurations (other than default config that was used for this example) needed to achieve optimal MCU speed?

Any insights or suggestions would be greatly appreciated!

TDK · ‎2025-03-06

> Since there are only 3 assembly instructions for each SET / REST in the above code, we expected the execution time to be about 6 instruction cycles.

The concept of 1 cycle per instruction (or even a fixed X cycles per instruction) is not something guaranteed on the M33 or other advanced cores. This is the price you pay for advanced performance.

> Is there a better method to test the MCU’s performance than toggling GPIO pin?

Yes, measure performance over something that matters. For example, whatever your program does. Writing to a SD card, performing ADC postprocessing, drawing to a screen, etc. Toggling a pin at the fastest possible speed using the CPU is not a useful thing to do. If you need a PWM, there are timers that can produce it.

> Is there an explanation for why the GPIO toggling takes so much longer to execute?

Writing to GPIO registers usually involves a bus access which can slow things down. And instructions are not in general 1 cycle each. You can look at the cycle counts per instruction for the M4 core to get an idea of what takes longer.

> Are there any additional configurations (other than default config that was used for this example) needed to achieve optimal MCU speed?

Enable compiler optimizations, use cache when available, use fast RAM such as DTCMRAM, put code which is run frequently into ITCMRAM.

If you feel a post has answered your question, please click "Accept as Solution".

xiangpu2001 · ‎2025-03-06

Thanks TDK,

> Since there are only 3 assembly instructions for each SET / REST in the above code, we expected the execution time to be about 6 instruction cycles.

The concept of 1 cycle per instruction (or even a fixed X cycles per instruction) is not something guaranteed on the M33 or other advanced cores. This is the price you pay for advanced performance.

The number of CPU clock cycles required to execute a specific ARM instruction is not fixed and depends on the CPU architecture. However, once the CPU architecture is fixed, the execution cycle of a given instruction also becomes fixed. For example, the instruction LDR R0, [R1, #0x4] running on a Cortex-M33 processor will always take a consistent number of clock cycles—it won’t vary between 1 cycle at one moment and 5 cycles at another. Is that correct?

> Is there an explanation for why the GPIO toggling takes so much longer to execute?

Writing to GPIO registers usually involves a bus access which can slow things down. And instructions are not in general 1 cycle each. You can look at the cycle counts per instruction for the M4 core to get an idea of what takes longer.
I have enabled the DWT feature on the STM32U545, allowing me to monitor the CPU clock cycles required for each instruction. However, I am confused about the execution time of the basic ARM instruction MOV R0, #0x1. I observed that it takes 6 CPU clock cycles. Is this expected behavior, or could there be an issue with my code configuration?

Are there any additional configurations (other than default config that was used for this example) needed to achieve optimal MCU speed?

Enable compiler optimizations, use cache when available, use fast RAM such as DTCMRAM, put code which is run frequently into ITCMRAM.
I have enabled the ICache, while the DCache remains disabled since it primarily affects external memory. However, I couldn't find any information about fast RAM, such as DTCM RAM or ITCM RAM, in the STM32U545 datasheet. Could you provide more details on this?

Thank you very much.

TDK · ‎2025-03-06

> I have enabled the DWT feature on the STM32U545, allowing me to monitor the CPU clock cycles required for each instruction. However, I am confused about the execution time of the basic ARM instruction MOV R0, #0x1. I observed that it takes 6 CPU clock cycles. Is this expected behavior, or could there be an issue with my code configuration?

It will usually be the same provided the context around it is the same. But no, it will not always be the same. In particular, cache misses, bus latency and instruction pipelining can affect how long a particular instruction takes. Your example is simple, but if instead you do something that loads from a memory address, that instruction can be shifted around a bit and doesn't even necessarily execute in the same order as it appears in the code.

I can't speak to how specifically the DWT interacts with timing individual instructions. Don't know enough about how the DWT works.

> I have enabled the ICache, while the DCache remains disabled since it primarily affects external memory. However, I couldn't find any information about fast RAM, such as DTCM RAM or ITCM RAM, in the STM32U545 datasheet. Could you provide more details on this?

It looks like this chip doesn't have DTCMRAM or ITCMRAM which are present on a lot of other chip families. Further, it looks like there is only one bus present between CPU and SRAM, so all RAM is equal here it seems.

Interesting that DCACHE1 doesn't affect SRAM on that chip. That is a departure from other families. Looks like SRAM access is usually 0 wait states.

This architecture is somewhat simpler than the M7 core that I typically use.

If you feel a post has answered your question, please click "Accept as Solution".

pavlo1r · ‎2025-04-15

Using the DWT (Data Watchpoint and Trace unit) is a solid approach for cycle counting. To estimate the total number of architecturally executed instructions, you can use the following formula:

ICNT = CNTCYCLES + CNTFOLD - (CNTLSU + CNTEXC + CNTSLEEP + CNTCPI)

explained in Arm®v8-M Architecture Reference Manual

Note: Accessing the DWT counters also incurs a small overhead due to the execution of read instructions and interaction with memory-mapped registers. Be sure to account for this cost if you're aiming for highly accurate measurements.

alfsch · ‎2025-04-15

>Speed Slower Than Expected ?

No.

I also made a simple speed test, same you ; set/reset pin + loop. On H563 , 250MHz , M33 core ; same core here.

I measured with scope :

at pin 4ns pulse, 1 instruction;

set/reset pin + loop : 16 ns, probably instruction cycles : 1 set, 1 reset pin, 2 while/jump;

so exactly what can be expected.

Just have: I-cache ON, optimizer setting -O2 ; (or -Ofast , but cannot get faster than 1 cycle ;) )

Without optimizer ON , the special code for an ARM CPU is not generated, just (nice to debug) one instruction after the other, no optimized code arrangement for maximum use of code+fast instructions.

And better no debug unit action (DWT...) in the loop for speed test, so just let it run an check with a scope on the pin.

And set pin speed high...its really fast on the pins! Otherwise you cannot see the hi speed signal.

Try it again - and tell..