Unexpected timing with nops

tompa · ‎2024-11-01

Hi,

I am confused with pin toggle timings, why I am getting non equal pulse widths like in the picture, I turn off all compiler optimizations and disable all of the interrupts, code is simple pin toggling. My controller is STM32L431 running on 80MHz.

Thank you very much for any advice!

__disable_irq();
  while(1) {
	  GPIOB->BSRR = GPIO_BSRR_BS15;	
	  __NOP();
	  __NOP();
	  GPIOB->BSRR = GPIO_BSRR_BR15;	
	  __NOP();
	  __NOP();
  }

KnarfB · ‎2024-11-01

Are you using one of those cheap logic analyzers with a sample rate of 24 MHz or lower?

Tesla DeLorean · ‎2024-11-01

Perhaps stop doing things this way..

First up, use HW resources, like the TIM to do stuff that is immune to the processor operating in saturation mode. Perhaps have that output as a direct constrast to the two methods, so you're sure you're observing what you think you're observing. As KnarfB points out, this could be a bandwidth/shannon/nyquist issue, some odd beat frequency because you're not observing it with enough bandwidth.

If you're doing saturation mode, at least look at the generated code, or write it all in assembler.

Unwind the loop, use registers/pointers for the interactions.

Watch for how the pipelining, the write-buffers, and the FLASH line reading/caching interact.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

tompa · ‎2024-11-01

Hi KnarfB,

Thank you for your reply, Yes I am using Saleae Logic with 24Ms/s and my signal is 500KHz so I don't think it is a problem. I also check the signal with 70MHz and 2GSa/s Keysight oscilloscope and result is the same...

tompa · ‎2024-11-01

Hi Tesla DeLorean,

Thank you for your replay,

I tried to use DWT but still having unequal pulsee widths.

CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

__disable_irq();
uint32_t start = DWT->CYCCNT;
while (1)
  {
    /* USER CODE END WHILE */

    /* USER CODE BEGIN 3 */
	  GPIOB->BSRR = GPIO_BSRR_BS12;  
	  start += 200;                 
	  while (DWT->CYCCNT < start);   

	  GPIOB->BSRR = GPIO_BSRR_BR12;  
	  start += 200;                  
	  while (DWT->CYCCNT < start);   
  }

Can you please explain this a little bit more:
"Watch for how the pipelining, the write-buffers, and the FLASH line reading/caching interact."
Thanks a lot

KnarfB · ‎2024-11-01

I used a Nucelo-L432KC and a 100 MHz Analog Discovery 3 USB scope. Waveforms look quite regular as expected:

Code was compiled in Release mode. Had to change unavailable PB15 to PB4 and set it to very high speed. Double checked disassembly with >arm-none-eabi-objdump -d build\Release\cycle_timing.elf

The relevant loop is:

 800048e:       61a5            str     r5, [r4, #24]
 8000490:       bf00            nop
 8000492:       bf00            nop
 8000494:       61a3            str     r3, [r4, #24]
 8000496:       bf00            nop
 8000498:       bf00            nop
 800049a:       e7f8            b.n     800048e <main+0x16e>

The bit masks are pre-computed before the loop.

hth

KnarfB

KnarfB · ‎2024-11-01

Here is another plot showing PA8 (at very high speed) as master clock output (MCO) with SYSCLK/8 == 10 MHz in blue.

despite the poor signal quality you see that each loop takes 8 cycles. This is consistent with the ARM® Cortex®-M4 Processor Revision: r0p1 Technical Reference Manual if assuming 2 cycles for the branch and one for each other instruction in the loop.

tompa · ‎2024-11-01

Uh, I don't know what is happening in my case... I am using STM32Cube...

With 200 cycles I am getting 2.458us and sometimes 2.583us measured on Saleae Logic which is correct - I can see jitter on the scope...

How does your C code look like ?

Tesla DeLorean · ‎2024-11-01

Ok, so that advance compare isn't how to do this, you want to delta the measurements, and then compare the difference. That way wrapping is handled by the unsigned math directly.

while ((DWT->CYCCNT - start) < 200);

start += 200;

The pipeline is multiple cycles long, the throughput at an instruction level might be 1, but the completion time is not.

The writes occur asynchronously/later, most here have 2 write buffers, that get processed at bus speeds before they stall the execution pipeline awaiting completion. This is why Hard Faults can be "imprecise" you have to walk back the PC to find the offending STORE instruction when a WRITE fails.

Several of the STM32 CM4 have a caching mechanism bolted outside the core, to the FLASH, the width of the FLASH and this cache can be quite wide, perhaps 64 or 128-bit wide, take the same 35ns, or so, to read the line vs a byte, or a word, but once cached the other parts of the line can be read/prefetched with no costs. The eviction mechanics and the branch alignment will impact how quickly each loop executes, assume it's not consistent. If you want more consistency, run code from RAM.

The bus the GPIO is on is not operating in a single cycle, a write is going to take at least 4 cycles to complete.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

tompa · ‎2024-11-01

Uh, ok.

How to put that code to RAM, is it complicated ?

Is there maybe any other solution to try ?

Thank you very much!