STM32F746 Timing Strangeness

jerry2 · ‎2022-02-17

I've got a Nucleo-F746 board and have noticed a code execution time anomaly that I can't explain.

The code running is very simple:

unsigned int j = 750000;
while(j--)
     ;

Essentially a loop that does nothing except waste time. When I run this on the board and time the loop with the SysTick timer, I get different results depending on whether I build and run the code with Atollic True Studio, STMCubeIDE (v1.8.0), or Segger Embedded Studio. The loop takes twice as long to run under STMCubeIDE than it does on the other two.

This is not a code optimization issue, because the assembly code generated for the while loop by all three tools is exactly the same (and it runs with interrupts disabled):

loop:
ldr    r3,[sp, #4]
subs   r2,r3,#1
str    r2,[sp, #4]
cmp    r3,#0
bne.n  loop

I'm not using the IDE-provided startup code. I'm doing it all myself, including setting up the STM32's clocking, initializing the SysTick timer and caches, prefetch etc. The code is identical across all three IDEs, and I've verified that after it runs the system registers for clocking, etc., are exactly the same.

What makes this stranger is that when this code is run on STMCubeIDE, the stack, where the counter for the loop is stored, is in DTCM memory, while it's in SRAM1 when running on Embedded Studio. In both cases, the code itself is running from the same part of FLASH memory.

This has me really puzzled. Anyone have any ideal what may be going on?

KnarfB · ‎2022-02-18

The machine code doesn't know who compiled it, so there must be a difference in your MCU setup or measurment methods. You can check some freq. using MCO output pin and measure SysTick frequency.

Cortex-M4 has a cycle accurate counter in case you want to further instrument your code.

The stack ... is assigned in the linker files (.ld).

hth

KnarfB

Danish1 · ‎2022-02-18

One thing to examine is the alignment of the program in memory - not just half-words but words, double and maybe quadruple words.

Although thumb instructions can be as short as 16 bits, FLASH memory accesses are often 128 bits or even 256 bits wide.

So depending precisely how many instructions are placed below your loop, so precisely where in FLASH the loop sits, there might be a different number of FLASH accesses and any associated wait-states.

I note you do mention cache and prefetch. Prefetch only benefits linear code, not tight loops. And I think (relatively simple, by pentium standards) branching/looping inevitably causes a pipeline flush. So I think whether the first fetch after the branch can get one or more instructions will make a difference.

Hope this helps,

Danish

jerry2 · ‎2022-02-18

I did some more checking. Before I did that, I set up the DWT cycle counter to get a more accurate timing of the loop.

The cycle timer shows that the exact same code running under STMCubeIDE takes twice as long to execute as it does under Atollic or Embedded Studio. This is very consistent. Turing off prefetch and the caches doesn't make much difference--the ratio is still 2x slower on STMCubeIDE.

I checked alignment of the loop code in flash, and on STMCubeIDE the start of the loop is located at 0x08000700 while on Embedded Studio it's at 0x080005A0, so it has better alignment on STMCubeIDE.

Since this is the exact same code in all three cases, the setup of the MCU clocking is identical, and I've verified that by looking at the appropriate RCC registers at runtime and they're all the same. I also ran another test using a UART to output some text strings, and it worked on all three platforms. I'd expect it not to work if the clocking was different on one as the baud rate calculation would be wrong, but it's not. I also output the PLL clock on the MCO pin and connected a frequency counter and the measured PLL frequency is the same on all three platforms.

I can understand various things making some difference in the execution time of this simple loop, but I can't think of anything that would result in a 2x difference except a difference in the CPU clock frequency, and I think I've ruled that out. I've stepped though all code from the reset vector through the jump to main(), and all of the code in the vendor-supplied startup code that messes with the clocks has been commented out and does not run. The only thing that affects the clocks is my own code, and it's identical on all three platforms and results in the same settings in the RCC registers before the timed code runs.

The only real difference is that the stack, where the loop counter is located, is in DTCM on STMCubeIDE and in SRAM1 on Atollic and Embedded Studio. I doubt that could account for a 2x difference in loop execution time, and in fact the stack being in DTCM should be an advantage.

I'm open to any and all suggestions about what to look at next.

jerry2 · ‎2022-02-18

Here's a scope photo taken from the output of the MCO2 pin (PC9 on the STM32F746). The MCO2 output is set to divide by 4, so the measured frequency of 54 MHz corresponds to a PLL frequency of 216 MHz, which is what is programmed. I see the same MCO2 output frequency when the board is programmed by Atollic or STMCubeIDE.

waclawek.jan · ‎2022-02-18

> I'm open to any and all suggestions about what to look at next.

- move stack to the same memory

- detach debugging adapter and reset (measure timing by wiggling a pin)

JW

jerry2 · ‎2022-02-18

I've tried detaching the debugger and measuring time with an oscilloscope. The results are the same: the loop runs twice as fast when the application is built with Atollic or Embedded Studio versus STM32CubeIDE.

waclawek.jan · ‎2022-02-18

- move stack to the same memory

JW

jerry2 · ‎2022-02-18

Tried that too. No change in results.

Danish1 · ‎2022-02-19

It's interesting you're giving the addresses as 0x0800 0nnn

That puts the FLASH on the AXIM bus, shares the 4kB I/O cache but doesn't have ART.

Have you compared execution times when the FLASH is at 0x0200 0nnn? That's where I put my code on stm32f7 so it can go over ITCM and make use of prefetch and ART accelerator.

----

I know you're not expecting any other accesses, but if you have a debugger looking at the stack, its accesses to DTCM might slow things down. But I wouldn't expect it to be precisely a 2:1 penalty.

Regards,

Danish