cancel
Showing results for 
Search instead for 
Did you mean: 

Generating a delay using "wasted" instructions

VVarg.1
Associate II

Hi All,

Using an STM32F429 and reading online documentation I have come frequently to this code used for generating delays in Ms if the SYSCLK is 16Mhz

void delayMs(uint32_t n)
{
    int i;
    for (; n > 0; n--)
        for (i = 0; i < 3195; i++) ;
}

I cant understand why the 3195 value is used?

In my understanding for achieving 1Ms of delay we will need

FOSC=16MHz

TOSC=1/16Mhz = 62.5nS

For 1Ms, then I can calculate how many cycles need to be "burned" by the following equation:

My_1_Ms_counter = 1mS / TOSC = 0.001 / 62.5nS = 16000

In summary I will need 16000 dummy instructions to be performed in order to obtain 1Ms of code execution (thus delay).

I come back to my questions of why does the 3195 is used?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

I can guess where the 3195 number comes from, and I can assure you it is wrong under at least some conditions and that you are just lucky it works as written. Here is the code we use to delay for *at least* a requested number of processor cycles (for an absolute delay in seconds, it is left to the caller to know what the current CPU frequency is and pass in the correct number of cycles):

static inline void
__spin_delay_m4(uint32_t cycles)
{
    // Spin for at least cycles MCU clock cycles.  SUBS + BNE is 3 cycles on
    // M4, so we round up to the next multiple of 3 cycles.
    uint32_t iterations = (cycles + 2)/3;
    asm volatile(
        ".balign 4;"
        "1:"
        "    subs %0, 1;"
        "    bne 1b;"
 
        :"+r"(iterations)
        :
        :"memory", "cc");
}

For a simple delay loop, there is a bit going on behind the scenes to make this possible. In this example, we assume that the code can be fetched with 0 wait states. If the loop is in flash and the flash implements a prefetch buffer (look for "ART Accelerator" in your reference manual) then we need to ensure the code is aligned to avoid straddling two prefetch lines which is why the ".balign 4" directive is there. If you have a full-on instruction cache or are executing from SRAM or ITCM then this may not be an issue for you. If you don't have any type of acceleration and you are running from flash, then this delay will probably take much longer than necessary unless your CPU is running slow enough that you can use 0 flash wait states.

This also assumes that the requested delay is a minimum and that its OK to wait longer than requested; if there are interrupts involved then any interrupt taken will not be accounted for and the delay will be longer than strictly necessary - so it isn't suitable for realtime purposes and should only be used to satisfy constraints such as "wait at least 2 ms for the clock to stabilize".

In this case, to delay for 1 ms when the CPU is running at 16 MHz you would pass a delay value of 16000, which would get divided by 3 for a count of 5334 loop iterations. My guess is that in the for-loop case of your example, the compiled loop consists of some sequence that takes 5 cycles instead of 3 and the correct value for that would be roughly 3200 iterations - pretty close to your 3195. Of course, a better or different optimization level could then ruin your week some time in the future when that compiled loop now only takes 3 cycles instead of 5 to execute... and all your delays end up way too short and none of your device drivers work any more.

Finally, as suggested by the name "__spin_delay_m4()", this is only valid for an M4 CPU. The M7 can use the same delay loop body, but with speculation it executes both the subtract and branch in a single cycle so the iteration count must be increased to match the desired cycle count (omit the (cycles + 2)/3 math). I haven't tested on an M0 or an M33 so cannot say what adjustment may be needed, if any.

As others have said, if you have access to a timer unit of some sort (SYSTICK, TIM or DWT) then those are all options for having a more accurate delay, and with those you could potentially put the CPU into low-power mode with a WFI instruction while waiting for an interrupt to wake you up at the end of the delay period. And they all have the benefit of being CPU-agnostic.

View solution in original post

14 REPLIES 14

>>I come back to my questions of why does the 3195 is used?

Worked and benchmarked on the writer's system? Please don't use it blindly yourself..

I'd avoid this type of loop with arbitrary constants, the compiler could optimize it away, different tools would yield different timing. Interrupts could occur consuming who knows how much time/cycles

Better to use a free-running 16-bit or 32-bit TIM, and delta the CNT register, with an established count/tick rate.

Or use the DWT_CYCCNT register which counts processor cycles.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Hi @Community member​ , Totally agree and I'm using TIM's on my project.

But I still can't figure out the math behind the 3195 value since after all, it works but I cant understand why. Can you check the math one more time?

TDK
Guru

Please don't use arbitrary for-based loops which are subject to change based on compiler settings, IRQ activity, and other things.

Use something tied to the actual clock. DWT->CYCCNT is the best option on the F4, in my opinion.

There's a lot of crap code out there. For based delay loops are used because they are easy to code, not because they are a good solution.

If you feel a post has answered your question, please click "Accept as Solution".

Hi @TDK​  Totally agree and I'm using TIM's on my project as a solid time reference.

But I still can't figure out the math behind the 3195 value since after all, it works but I cant understand why. Can you check the math one more time?

It's not important, and it's not worth anyone's time.

It's a lazy coding, likely trimmed with a toggling GPIO and scope, to accommodate looping times, and function returns, etc. If the method was EVEN that robust..

You'd need to dissect the listing file to understand what cycles it consumes accurately. There's frequently not a one-to-one relationship between C and Assembler lines, cycles, etc. Not worth trying to make one.

The code here was convenient for the person who coded it, and sufficient for all those that blindly copied it. Don't fall into that trap, or ponder why, what and how much..

The only way for code like this to be predictable across builds and tools would be to code it in assembler, and count cycles. But STM32 flash memory is slow on F1 parts, and not entirely predictable on F4's with the wide caching fetch mechanism called ART

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
I agree with everything Tesla is saying here. You could count clocks per cycle. Not all disassembly instructions take a single step. Most take more. These happen to add up to yield that magic value.
If you feel a post has answered your question, please click "Accept as Solution".

I can guess where the 3195 number comes from, and I can assure you it is wrong under at least some conditions and that you are just lucky it works as written. Here is the code we use to delay for *at least* a requested number of processor cycles (for an absolute delay in seconds, it is left to the caller to know what the current CPU frequency is and pass in the correct number of cycles):

static inline void
__spin_delay_m4(uint32_t cycles)
{
    // Spin for at least cycles MCU clock cycles.  SUBS + BNE is 3 cycles on
    // M4, so we round up to the next multiple of 3 cycles.
    uint32_t iterations = (cycles + 2)/3;
    asm volatile(
        ".balign 4;"
        "1:"
        "    subs %0, 1;"
        "    bne 1b;"
 
        :"+r"(iterations)
        :
        :"memory", "cc");
}

For a simple delay loop, there is a bit going on behind the scenes to make this possible. In this example, we assume that the code can be fetched with 0 wait states. If the loop is in flash and the flash implements a prefetch buffer (look for "ART Accelerator" in your reference manual) then we need to ensure the code is aligned to avoid straddling two prefetch lines which is why the ".balign 4" directive is there. If you have a full-on instruction cache or are executing from SRAM or ITCM then this may not be an issue for you. If you don't have any type of acceleration and you are running from flash, then this delay will probably take much longer than necessary unless your CPU is running slow enough that you can use 0 flash wait states.

This also assumes that the requested delay is a minimum and that its OK to wait longer than requested; if there are interrupts involved then any interrupt taken will not be accounted for and the delay will be longer than strictly necessary - so it isn't suitable for realtime purposes and should only be used to satisfy constraints such as "wait at least 2 ms for the clock to stabilize".

In this case, to delay for 1 ms when the CPU is running at 16 MHz you would pass a delay value of 16000, which would get divided by 3 for a count of 5334 loop iterations. My guess is that in the for-loop case of your example, the compiled loop consists of some sequence that takes 5 cycles instead of 3 and the correct value for that would be roughly 3200 iterations - pretty close to your 3195. Of course, a better or different optimization level could then ruin your week some time in the future when that compiled loop now only takes 3 cycles instead of 5 to execute... and all your delays end up way too short and none of your device drivers work any more.

Finally, as suggested by the name "__spin_delay_m4()", this is only valid for an M4 CPU. The M7 can use the same delay loop body, but with speculation it executes both the subtract and branch in a single cycle so the iteration count must be increased to match the desired cycle count (omit the (cycles + 2)/3 math). I haven't tested on an M0 or an M33 so cannot say what adjustment may be needed, if any.

As others have said, if you have access to a timer unit of some sort (SYSTICK, TIM or DWT) then those are all options for having a more accurate delay, and with those you could potentially put the CPU into low-power mode with a WFI instruction while waiting for an interrupt to wake you up at the end of the delay period. And they all have the benefit of being CPU-agnostic.

S.Ma
Principal

Actually, a sw delay loop should use first a calibration method to guess the loop amount. Adding a NOP in the loop will avoid compiler code removal. The loop is affected by compile option and compile brand as well... sw delay are sometime needed for quick bringup or when there is no spare hw resources. Use them as minimum delay, as its duration may vary with interrupt etc... they are however recursive, so umlimited supply...

Hi @Community member​ , I believe you missed the original point of digging out a possible explanation which I appreciate you took the time to elaborate.