Why does the execution time of STM32F767ZIT6 (used in NUCLEO-F767ZI) not grow as expected when the amount of code to execute increases?

I measure the execution time of a function by putting the toggle of a digital output and the function call into a cycle and measuring the period of the square wave at the digital output with an oscilloscope. The execution time is half a period.

To test the procedure, I used instead of the function a dummy code (see main function in the annex execution_time_main.c) where the cnt_max parameter allows me to vary the amount of code to execute.

The measured periods are visible in the attachment CRAZY_TIMES.jpg

I do not understand why both with the ARM V5 and V6 compiler in Keil uVision the trend of the periods is that shown in the attachment CRAZY_TIMES.jpg

  • Comp V5, test case 8 to 9 cnt_max go from 8000 to 9000 (increases) and  period go from 485 us to 454 us (decreases) !
  • Comp V5, test case 10 to 11 cnt_max go from 10 000 to 100 000 (x 10) and period go from 505 us to 11.1 ms (x 22) !
  • Comp V6, test case 10 to 11 cnt_max go from 10 000 to 100 000 (x 10) and period go from 607 us to 29.6 ms (x 49) !



I don't have much "ARM experience" (more with AVRs & FPGAs), anyway, that behavior is really strange.

Maybe somehow the debugging is kinda intrusive, always checking the volatile variable?


The M7 pipeline is considerably more complex than a PIC or 68000 which leads to things like this. You have 7 wait states and if the processor needs to load two different pages, it will take some time. You can't get away with the simplicity of a constant X cycles per instruction when you're running at several hundred MHz. Things are exacerbated when your code is otherwise so short. Enabling ART Accelerator may eliminate this effect entirely, would be interesting to see.

I had similar problems without using the debugger, but measuring times by toggling a digital output and measuring the period with an oscilloscope

0693W00000Ka6K0QAJ.jpgFor example, consider the following code

#include "stm32f767xx.h"
void test_function (uint32_t iterations);
void initialize (void);
int main (void) 
    volatile uint32_t system_core_clock_for_watch = SystemCoreClock; // @1
    for (;;)
        GPIOB->ODR ^= GPIO_ODR_OD10;  // Toggle PB10
    return 0;
void test_function (uint32_t iterations)
    volatile uint32_t cnt = 0;
    /* Dummy code */  
    while (cnt < iterations)
void initialize (void)
    volatile uint8_t cnt = 0; /* NOTE. cnt is only used  to generate a small 
        delay. volatile ensures that compiler optimizations don't delete the 
        delay. */
     * Configure PB10 (CN10.32 in NUCLEO-F767ZI) as DO
    /* Enable IO port B clock */
    /* Small delay. 
       NOTE. From RM0410 rev 4 par par 5.2.12.
       "Just after enabling the clock for a peripheral, software must wait for 
        a 2 peripheral clock cycles delay before accessing the peripheral
       Port B is on AHB bus and AHB bus frequency is equal to HCLK frequency
       that is equal to 216 MHz.*/
    cnt = 0;
    while (cnt < 100)
    /* Configure MODER10 = 01b <==> General purpose output mode */
    GPIOB->MODER |=  (0 * GPIO_MODER_MODER10_1) | (1 * GPIO_MODER_MODER10_0);
    /* Configure OSPEEDR10 = 00b <==> low speed */
                      (0 * GPIO_OSPEEDR_OSPEEDR10_0);

Period measured with oscilloscope of signal at PB10 digital output is 687.7 ms


If you delete line 10

volatile uint32_t system_core_clock_for_watch = SystemCoreClock; // @1

the period becomes 1.892 s (x 2.75)


In my original problem, the clock was at 216 MHz and consequently the wait states were actually 7. Now the clock is at 16 MHz and there were no wait states. See tab 7 in the reference manual RM0410 rev 4 below..


Don't look to enable the MCU side caching either.

The CM7 is super-scaler, so how the underlying fetch mechanics and pairing works could account for some meta-stability issues in timing. These days I tend to focus on algorithmic improvements, rather than instruction level stuff, although obviously the longer the pipeline, the more impact stalls create, along with register dependency chains.

I don't think it's like the 68K where I can annotate timings directly into the disassembly listings, and get huge wins by instruction selection..

  * @brief  CPU L1-Cache enable.
  * @param  None
  * @retval None
static void CPU_CACHE_Enable(void)
  /* Enable I-Cache */
  /* Enable D-Cache */

Hi @Tesla DeLorean (Customer)​, @KiptonM (Customer)​, @TDK (Customer)​ and @LCE (Customer)​ 

Tesla DeLorean is right :)

I tried to enable the cache and almost all the previous anomalies disappeared. The only "anomaly" left is the one reported on February 16, 2022 at 11:20 AM (I do not understand why ... test case 8 to 9 cnt_max go from 8000 to 9000 (increases) and period go from 485 us to 454 us (decreases)! ). Likely, as already said by you, could be the compiler rearranging things based on the compiled value.

Are there any reasons for deciding not to enable the cache?

Hi @Tesla DeLorean (Customer)​, @KiptonM (Customer)​, @TDK (Customer)​ and @LCE (Customer)​ 

Thank you all for the time you have dedicated to me.