What's a good way to waste a clock cycle in STM32F4?

arnold_w · ‎2024-03-19

In the thread https://community.st.com/t5/stm32-mcus-embedded-software/why-does-my-tx-only-software-3-mbaud-uart-sometimes-send-strange/td-p/218077/page/2 I learned that "DSB" is a good way to waste a clock cycle in STM32F769. Now, what's a good way to waste a clock cycle in STM32F4? I don't want the instruction to cause traffic on the bus matrix and in my experience "NOP" gets optimized away by the GCC compiler. If I enforce no optimization, then the "NOP" appears to take 3 instructions instead of 1:

inline static void __attribute__((optimize("O0"))) nopNeverOptimizedAwayByCompiler() {
8027af4: b480 push {r7}
8027af6: af00 add r7, sp, #0
asm("NOP");
8027af8: bf00 nop
}

tjaekel · ‎2024-03-19

"DSB" is instruction as "Data Storage Barrier": it means: the code just continues after the write (e.g. to a register) was done really. If you do not would write before to a register - it will not have any effect.

There is also "ISB" ("Instruction barrier"): the code just continues after the instruction was done (might make more sense here).

NOP is good to "waste a clock cycle".
You can use NOP, as __NOP(); also in C-code.

It should not be "optimized". You can also tell the compiler via #pragma not to optimize (as you did):

#pragma GCC push_options
#pragma GCC optimize ("O0")

your code

#pragma GCC pop_options

I use __NOP() for micro-seconds delay and it is working.

If a single __NOP(); is one or even more clock cycles depends also if you have ICache enabled or not: it can happen, that the __NOP(); is the first instruction of a new "Cache Line": in this case, a complete "Cache Line" will be read, which takes much more time (e.g. reading the next 32 instructions into cache).

BTW: the code of your inline function - I do not understand completely: There is a push {r7} but never a pop for it.

BTW: with the pipelining and also "speculative prefetch" in ARM processors - it is hard to predict how many clock cycles are really done. I would suggest when using NOPs - use a scope or measure the elapsed time over a period, e.g. 100 NOPs done - in order to get a clue how may cycles are "wasted". Now, you can estimate the time for one NOP. But never relay on 1 clock cycle per NOP (it depends on caches). Later, it can "jitter" (if just one or even more clock cycles needed).

Also:

When you do NOPs in a loop, like:

void NOP_Loop(int i)
{
 while(i--)
    __NOP();
}

bear in mind the overhead (additional clocks "wasted") by the FOR-loop and check there itself.

You can "roll-out" a LOOP:

void NOP_Loop(void)
{
  __NOP();
  __NOP();
  __NOP();
  __NOP();
}

This makes it a bit more "precise".

arnold_w · ‎2024-03-19

Thank you for your thorough and insightful post!

@tjaekel wrote:
BTW: the code of your inline function - I do not understand completely: There is a push {r7} but never a pop for it.

I wrote the function like this in my source code:

inline static void __attribute__((optimize("O0"))) nopNeverOptimizedAwayByCompiler() {
asm("NOP");
}

When I compile it in System Workbench and generate the dump-file, the disassembly looks like this (I copied some extra lines before and after as well because maybe I was about wrong where it starts and ends):

08027af4 <nopNeverOptimizedAwayByCompiler>:
#pragma GCC optimize ("O3")

#define DEFAULT_INT_PRIORITY (1)
static void SystemClock_Config_(void);

inline static void __attribute__((optimize("O0"))) nopNeverOptimizedAwayByCompiler() {
8027af4: b480 push {r7}
8027af6: af00 add r7, sp, #0
asm("NOP");
8027af8: bf00 nop
}
8027afa: bf00 nop
8027afc: 46bd mov sp, r7
8027afe: f85d 7b04 ldr.w r7, [sp], #4
8027b02: 4770 bx lr

08027b04 <main>:

int main(int argc, char* argv[]) {
8027b04: b500 push {lr}
8027b06: b083 sub sp, #12

So maybe I misunderstood the disassembly and my nop-function actually occupies 7 (!) instructions?

waclawek.jan · ‎2024-03-19

In 'F7 (i.e. Cortex-M7), DSB is recommended to waste time, as it is guaranteed to execute, whereas NOP may be purged from the pipeline before . While ARM threatens that this may happen in Cortex-M3/M4, too, AFAIK that was never implemented, so NOP is a good time waste there.

Note that, for multiple reasons, no time-wasting delay guarantees you *exact* delay; rather, even if written as well as possible, it will guarantee "only" a *minimum* delay. It may last longer than the expected number of cycles, depending on circumstances; that may mean also jitter (i.e. that the delay changes upon individual passes through that piece of code). There's no good way to produce *exact* delays in 32-bitters, you ought to use hardware (peripherals) if you want exact timing of signals.

> So maybe I misunderstood the disassembly and my nop-function actually occupies 7 (!) instructions?

Show the place where you actually *call* that function.

Instead of your function, you can also simply use the __NOP() intrinsics (mandated by CMSIS) without further ado.

As it happens in the "callable- version" (as opposed to inlined) of your function, compiler may decide to add an extra "filler" nop to make the subsequent instruction aligned. Setting up compilers to do exactly what you want is tricky to impossible.

You also may want to write the *whole* section where you want precise (or as precise as possible) control over timing, in asm, inline or "normal" as dedicated asm source file. Sprinkling C code with asm snippets may not result in what you want.

JW

arnold_w · ‎2024-03-19

Thank you for your very informative post!

@waclawek.jan wrote:
Show the place where you actually *call* that function.

This is what it looks like where I call it:

NVIC->STIR = I2C2_ER_IRQn;
8027b82: 4b2a ldr r3, [pc, #168] ; (8027c2c <main+0x128>)
8027b84: 2222 movs r2, #34 ; 0x22
8027b86: f8c3 2e00 str.w r2, [r3, #3584] ; 0xe00
nopNeverOptimizedAwayByCompiler(); // Wait one instruction for the software generated interrupt to be triggered
8027b8a: f7ff ffb3 bl 8027af4 <nopNeverOptimizedAwayByCompiler>
__ASM volatile ("cpsid i" : : : "memory");
8027b8e: b672 cpsid i

// I2C2_ER_IRQHandler() is now executing. Only CAN-bus has higher priority than I2C2_ER_IRQHandler

__disable_irq();

waclawek.jan · ‎2024-03-19

OK so the compiler for some reason does not inline that function... interesting, but not shocking. As I've said, it's hard to coax the compiler to do things strictly in the way one would like to.

You may try the always_inline attribute, or just write the inline asm where you need it, or make it a macro, or use the __NOP() intrinsics.

JW