Confusing behavior of asm volatile("nop")

JHöfl.1 · ‎2021-03-01

Hello dear ST-Community,

I'm currently working on a 8-Bit interface for a TFT-Display. I'm using a STM32F405RGT6 for the Project.

While debugging a timing issue i ran into, I found an interesting behavior when using __asm__ __volatile__("nop") for ns-scale delay. Here is the interesting portion of the code:

#define DELAY_5NS      {__asm__ __volatile__("nop");}
#define DELAY_10NS    {DELAY_5NS; DELAY_5NS;}
#define DELAY_15NS    {DELAY_10NS; DELAY_5NS;}
#define DELAY_20NS    {DELAY_10NS; DELAY_10NS;}
#define DELAY_25NS    {DELAY_20NS; DELAY_5NS;}
#define DELAY_50NS    {DELAY_20NS; DELAY_20NS; DELAY_10NS;}
#define DELAY_75NS    {DELAY_50NS; DELAY_25NS;}
#define DELAY_100NS  {DELAY_50NS; DELAY_50NS;}
#define DELAY_150NS  {DELAY_100NS; DELAY_50NS;}
#define DELAY_200NS  {DELAY_100NS; DELAY_100NS;}
#define DELAY_250NS  {DELAY_150NS; DELAY_100NS;}
 
int main(void)
{
    HAL_Init();
 
    SystemClock_Config();
    
    MX_GPIO_Init();
    MX_USB_DEVICE_Init();
    MX_TIM1_Init();   
 
    GPIOC->MODER = 0x55555555 & 0xFFFF;     // Setting Pins 0-7 of Port C as output
    GPIOC->OSPEEDR = 0x55555555 & 0xFFFF;  // Setting Pins 0-7 of Port C as "medium speed"
 
    while (1)
    {
        GPIOC->BSRR = 0b0000000011111111; 
        DELAY_5NS;
        GPIOC->BSRR = 0b0000000011111111 << 16;
        DELAY_5NS;
    }
}

When I observe PC0 (or any of PC0-7) without a delay like so:

while(1)
{
    GPIOC->BSRR = 0b0000000011111111;
    GPIOC->BSRR = 0b0000000011111111 << 16;
}

I get a Period of routhly 71,5 ns (about 12 cycles at 168MHz ?)

When running the code with DELAY_5NS (acually about 5,9ns at 168MHz) I expect an increased period of about 11-12ns. what I get is exactly have of it. I can also clearly see why, because there is absolutely no difference between the following two snippets with respect to timing:

while(1)
{
    GPIOC->BSRR = 0b0000000011111111;
    GPIOC->BSRR = 0b0000000011111111 << 16;
    DELAY_5NS;
}
 
// same as
 
while(1)
{
    GPIOC->BSRR = 0b0000000011111111;
    DELAY_5NS;
    GPIOC->BSRR = 0b0000000011111111 << 16;
    DELAY_5NS;
}

I canprove it with oscilloscope screenshots, but I have to find a USB-Stick first.

The delay between the GPIO-Operations has to be at least DELAY_15NS to affect the actual signal on the scope.

So the question is obviously: WHY?

And how can I take this into account when dealing with complicated timing problems?

Thanks in advance for feedback!

Greetings

Johannes

waclawek.jan · ‎2021-03-01

http://www.efton.sk/STM32/r.png

SEV is single-cycle (i.e. NOP-like) instruction which generates a 1-cycle long pulse, here on PA1; r2 and r3 are set so that it's a set and clear of PA2 through BSRR; PA8 is system clock on MCO.

Note, that while the two writes to GPIOA_BSRR are one cycle astride, the output pulse on PA2 is single-cycle, i.e. the writes arrive to BSRR one after other in two consecutive cycles. The reason is, that after the first write from processor to the AHB bus containing GPIOA, the arbiter of that bus delays the write by one cycle for the arbitration. As the bus is already acquired by the processor at the moment when the second write arrives, there is no more arbitration delay and the write goes through immediately.

This all only if the write buffer on the processor's S port is switched on. If it's switched off using SCB_ACTRL.DISDEFWBUF = 1 (http://www.efton.sk/STM32/r3_DISDEFWBUF.png upper waveform with write buffer on, lower with write buffer off), the picture changes dramatically, as after each write the processor waits until the write is completed.

JW

View solution in original post

waclawek.jan · ‎2021-03-01

> So the question is obviously: WHY?

First of all, remove dependency on compiler, i.e. use assembler (or at least observe disasm).

Second remove the FLASH-related issues, ie. ran code from RAM.

But even then, the problem is still much more complex than you'd expect if you come from the nice synchronous world of 8-bitters. The 32-bitters are complex beasts, memories and peripherals slapped to a processor and held together with a fabric of buses and a tangled bunch of forward and feedback signals. While still deterministic, timing is so complex that it's not worth to try to describe it in details, and manufacturers simply don't.

> And how can I take this into account when dealing with complicated timing problems?

In 32-bitters, don't attempt tight timing using software. Use hardware; in your case, FSMC (not that it does not have its own deal of gotchas).

JW

waclawek.jan · ‎2021-03-01

http://www.efton.sk/STM32/r.png

SEV is single-cycle (i.e. NOP-like) instruction which generates a 1-cycle long pulse, here on PA1; r2 and r3 are set so that it's a set and clear of PA2 through BSRR; PA8 is system clock on MCO.

Note, that while the two writes to GPIOA_BSRR are one cycle astride, the output pulse on PA2 is single-cycle, i.e. the writes arrive to BSRR one after other in two consecutive cycles. The reason is, that after the first write from processor to the AHB bus containing GPIOA, the arbiter of that bus delays the write by one cycle for the arbitration. As the bus is already acquired by the processor at the moment when the second write arrives, there is no more arbitration delay and the write goes through immediately.

This all only if the write buffer on the processor's S port is switched on. If it's switched off using SCB_ACTRL.DISDEFWBUF = 1 (http://www.efton.sk/STM32/r3_DISDEFWBUF.png upper waveform with write buffer on, lower with write buffer off), the picture changes dramatically, as after each write the processor waits until the write is completed.

JW

JHöfl.1 · ‎2021-03-01

Thanks for the detailed explaination!

I was afraid that it had something to do with the way the BSRR works, but it's much more complicated than that I suppose.

I was testing oher means of software-delay like writing to the BSRR multiple times. If I understand you correctly, I would get consistent timeings that way, since the Bus is already aquired.

I have to look at the disasm next. I was under the impression that __volatile__ would be enougth to prevent the compiler from optimizing it away but who knows.

I never used FSMC before, since I always thought Bit-Banging would be simpler. I may have to look into it.

waclawek.jan · ‎2021-03-01

> I was under the impression that __volatile__ would be enougth to prevent the compiler from optimizing it away

It probably is, but as you've said, one never knows. Using asm/looking at disasm means taking one unknown out of the equation.

It's the chip internals which bite you, though.

Btw., ARM doesn't recommend using NOP as time wasting instruction, as it may be removed early in the pipeline before it reaches the processor execution unit, but in F4 in my experience it does not do this.

> Bit-Banging would be simpler

Maybe it is, surely it's easier to setup and more flexible in pin usage.

From programming standpoint, you can write a single byte/halfword into it and forget about it, FSMC performs the necessary handshake automatically, maintaining whatever timing you set. You can even DMA into it, which may be convenient when it comes to displays. OTOH, as I've said, it has its own set of idiosyncrasies, e.g. there's an unadjustable single-cycle data hold in write.

JW

JHöfl.1 · ‎2021-03-02

I was about to test the FSMC since I found a nice explaination online. But it seems like my chip (STM32F405RGT6 64-Pin Version) does not support it. At least according to STm32CubeIDE. But the datasheet claims the whole STM32F405xx family supports it. Is it because of the low pin count?

waclawek.jan · ‎2021-03-03

Yes, on the 64-pin package there's virtually no FSMC pin brought out (most of the FSMC pins are those which are added onto the 100-pin package compared to the 64-pin one, and yet there still are not enough addresses so for full FSMC you have to go for the 144-pin package).

The next technique to comtemplate is outputting data using timer-triggered DMA, while generating the control signals (WR/CS/whatever) using the timer output compare channels themselves, maybe using chained timers where appropriate. This also has its own gotchas (e.g. you can use only TIM1 and TIM8), and if it has to be safe (DMA can be delayed by conflicts within the DMA unit and conflict on buses when reading from memory and writing to GPIO) it might have to be relatively slow, but again a chunk of data can be output effectively without processor intervention, which in typical graphic application is a bonus.

JW

Confusing behavior of __asm__ __volatile__("nop")

Confusing behavior of asm volatile("nop")