STM32U545 Execution Speed Slower Than Expected

xiangpu2001 · ‎2025-03-06

Software Environment:

Windows platform
Configured using STM32CubeMX
Compiled with STM32CubeIDE

Hardware Platform:

NUCLEO-U545RE-Q
MCU: STM32U545 (Cortex-M33, ARMv8-M)

Test Objective:
We initially noticed slow SPI performance (CS low to CS high) and decided to test the MCU performance by measuring the simplest instruction set to toggle a GPIO pin. We further simplified the HAL_GPIO_WritePin( ) to only write to the registers back to back as shown below:

And the Assembly equivalent:

Issue Description:

Since there are only 3 assembly instructions for each SET / REST in the above code, we expected the execution time to be about 6 instruction cycles. However, when we measure the actual toggle time, we actually measure about 10 instruction cycles for the resulting pulse period. For example with the SYSCLK & AHB2 CLK running @ 16 MHz, we see a period of 680 nS which represents about 11 instruction cycles. (We are assuming that each assembly instruction takes 1 instruction cycle to execute.)

Details of test:

Clock Configuration:

HCLK: 16 MHz
AHB2 / GPIO Clock: 16 MHz

Test Method:
After initializing the system, we toggle a GPIO pin in code and measure the switching speed.
For this test the GPIO speed was set to SLOW , FAST and VERY FAST. The output waveform did not change appreciably. Also, the GPIO clk was enabled using the default init call of __HAL_RCC_GPIOA_CLK_ENABLE();
For this test, no other peripheral or interrupts have been enabled.

Questions:

Is there a better method to test the MCU’s performance than toggling GPIO pin?
Is there an explanation for why the GPIO toggling takes so much longer to execute?
Are there any additional configurations (other than default config that was used for this example) needed to achieve optimal MCU speed?

Any insights or suggestions would be greatly appreciated!

TDK · ‎2025-03-06

> Since there are only 3 assembly instructions for each SET / REST in the above code, we expected the execution time to be about 6 instruction cycles.

The concept of 1 cycle per instruction (or even a fixed X cycles per instruction) is not something guaranteed on the M33 or other advanced cores. This is the price you pay for advanced performance.

> Is there a better method to test the MCU’s performance than toggling GPIO pin?

Yes, measure performance over something that matters. For example, whatever your program does. Writing to a SD card, performing ADC postprocessing, drawing to a screen, etc. Toggling a pin at the fastest possible speed using the CPU is not a useful thing to do. If you need a PWM, there are timers that can produce it.

> Is there an explanation for why the GPIO toggling takes so much longer to execute?

Writing to GPIO registers usually involves a bus access which can slow things down. And instructions are not in general 1 cycle each. You can look at the cycle counts per instruction for the M4 core to get an idea of what takes longer.

> Are there any additional configurations (other than default config that was used for this example) needed to achieve optimal MCU speed?

Enable compiler optimizations, use cache when available, use fast RAM such as DTCMRAM, put code which is run frequently into ITCMRAM.

If you feel a post has answered your question, please click "Accept as Solution".

xiangpu2001 · ‎2025-03-06

Thanks TDK,

> Since there are only 3 assembly instructions for each SET / REST in the above code, we expected the execution time to be about 6 instruction cycles.

The concept of 1 cycle per instruction (or even a fixed X cycles per instruction) is not something guaranteed on the M33 or other advanced cores. This is the price you pay for advanced performance.

The number of CPU clock cycles required to execute a specific ARM instruction is not fixed and depends on the CPU architecture. However, once the CPU architecture is fixed, the execution cycle of a given instruction also becomes fixed. For example, the instruction LDR R0, [R1, #0x4] running on a Cortex-M33 processor will always take a consistent number of clock cycles—it won’t vary between 1 cycle at one moment and 5 cycles at another. Is that correct?

> Is there an explanation for why the GPIO toggling takes so much longer to execute?

Writing to GPIO registers usually involves a bus access which can slow things down. And instructions are not in general 1 cycle each. You can look at the cycle counts per instruction for the M4 core to get an idea of what takes longer.
I have enabled the DWT feature on the STM32U545, allowing me to monitor the CPU clock cycles required for each instruction. However, I am confused about the execution time of the basic ARM instruction MOV R0, #0x1. I observed that it takes 6 CPU clock cycles. Is this expected behavior, or could there be an issue with my code configuration?

Are there any additional configurations (other than default config that was used for this example) needed to achieve optimal MCU speed?

Enable compiler optimizations, use cache when available, use fast RAM such as DTCMRAM, put code which is run frequently into ITCMRAM.
I have enabled the ICache, while the DCache remains disabled since it primarily affects external memory. However, I couldn't find any information about fast RAM, such as DTCM RAM or ITCM RAM, in the STM32U545 datasheet. Could you provide more details on this?

Thank you very much.