inline assembly macro inconsistent

RBack.1 · ‎2023-08-05

I am using the STM32F746VET6. I am aware that assembly instructions do not have a guaranteed in execution time, however I am surprised how consistently inconsistent my inline assembly macro is behaving. I'm really hoping there is just a mistake in the macro that I can fix. My IVT is in ITCM (memory address 0x0) and the ISR is also in ITCM (memory address 0x200). The 1st, 2nd, 4th, 5th, and 6th pin states show very close to the same delay every time. The 3rd and 7th pin states are always ~1.7x the others, however.

This is the macro:

#define nop_delay_def(delay_var) __asm volatile (".Lloop%=:               \n\t" \
                                                 "     subs  %[delay], #1 \n\t" \
                                                 "     bne.n .Lloop%=         " \
                                                 :                              \
                                                 : [delay] "r" (delay_var))

And this is the ISR:

void __attribute__((optimize("O0"), section(".itcm_text"))) TIM1_TRG_COM_TIM11_IRQHandler(void) {

    PULSER1_GPIO_Port->ODR |= PULSER1_IN1_POS_Pin;
    nop_delay_def(60);
    PULSER1_GPIO_Port->ODR |= PULSER1_IN2_POS_Pin;
    nop_delay_def(60);
    PULSER1_GPIO_Port->ODR &= ~PULSER1_IN1_POS_Pin;
    nop_delay_def(60);
    PULSER1_GPIO_Port->ODR &= ~PULSER1_IN2_POS_Pin;
    nop_delay_def(60);
    PULSER1_GPIO_Port->ODR |= PULSER1_IN1_POS_Pin;
    nop_delay_def(60);
    PULSER1_GPIO_Port->ODR |= PULSER1_IN2_POS_Pin;
    nop_delay_def(60);
    PULSER1_GPIO_Port->ODR &= ~PULSER1_IN1_POS_Pin;
    nop_delay_def(60);
    PULSER1_GPIO_Port->ODR &= ~PULSER1_IN2_POS_Pin;
    HAL_IWDG_Refresh(&hiwdg); // kick the watch dog since FreeRTOS is suspended

    TIM11->SR = 0;
}// TIM1_TRG_COM_TIM11_IRQHandler

I have attached a scope shot at the bottom. There is no jitter or variation in the pulses. They show that the macro is inconsistent consistently.

Tesla DeLorean · ‎2023-08-05

Could you not put code fragments in startup.s where you can control the alignment, placement and interworking ?

Don't use RMW on the GPIO ODR, use the BSRR to change the state of bit(s) in a single write cycle

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

TDK · ‎2023-08-05

The Cortex-M7 core just doesn't make a guarantee you can use here. I don't see any issues with the macro. You could use DWT->CYCCNT to get within probably about 3 ticks of your target.

If you feel a post has answered your question, please click "Accept as Solution".

RBack.1 · ‎2023-08-05

@Tesla DeLorean wrote:
Could you not put code fragments in startup.s where you can control the alignment, placement and interworking ?
Don't use RMW on the GPIO ODR, use the BSRR to change the state of bit(s) in a single write cycle

I'll look into putting code fragments in startup.s if that's a possible solution to this problem! I can look into BSRR as well. Thanks!

RBack.1 · ‎2023-08-05

Thanks, I understood that there is no guarantee I just expected that to be a few ticks variation at worst rather than consistently 1.7x the execution time of the loop.Unfortunately my application needs to accurate down to one tick.

The obvious correct solution is to use an output compare instead of code to control the timing but that was complicated by the need to drive 15 pins in 7 states. We do think we may have come up with a solution to this using multiple timers and DMA but our board isn't currently laid out to use them.

The really frustrating thing is a previous version of code did test to have accurate timing down to one tick and we weren't even using an inline assembly macro at that time: we were calling an assembly function! It seems very small changes to the code result in very big changes in the timing, however, and we just got lucky with that previous version of code (for a whole year!).

KnarfB · ‎2023-08-05

If similar code used to be good enough before, check what else has changed in the system setup, like I and D caches?

Try unrolling the loop and make it branchless. More code, but more deterministic.

Try timer-driven DMA to GPIO register if those pins are concentrated on one or a few ports. No guarantee for perfect timing either, but it may be worth a try.

And, I definitely like the idea of @TDK, not using a fixed-length loop, but polling DWT->CYCCNT until the delay time is up.

hth

KnarfB

waclawek.jan · ‎2023-08-06

Isn't there an overlooked obvious solution, i.e. a higher-priority interrupt prolonging the 3rd and 7th pulse?

If not, this is probably the price you pay for the performance. The CM7 is an overcomplicated beast, mildly superscalar, with features like speculative jump prefetch (with no public description of its algorithm). I'm not sure whether this grants the 1.7x time penalty, though; that's why the question above.

The asm version (i.e. a function called) has somewhat bigger chance to be consistent, as it sits always at the same address and uses the same registers.

JW

RBack.1 · ‎2023-08-06

We've been tearing our hair out trying to figure out what the difference is between the old and new code. We've disabled all of the advanced features we can disable in MX but our old code functioned as expected without doing that!

We have tried unrolling the loop but frustratingly we found that 20 asm nops in the first cycle and 30 asm nops in the second is what resulted in the pulses being the same width!

We do think we've figured out a solution using hardware but we need to test it. The issue is that we have 15 pins we need to drive in seven states so a timer/dma solution isn't trivial and our PCB isn't currently laid out with the correct pins.

The DWT-CYCNT solution can be used for wide pulse widths but unfortunately we need the pulse widths to be within 1% down to 70ns (we operate at 216MHz).

RBack.1 · ‎2023-08-06

I didn't post the code waiting for our ISR but it basically calls __disable_irq() then enters a while(1).

We've been battling with this overcomplicated best for a while now which is good for our higher level algorithms but really bad at simple things like driving I/O reliably. A hardware solution is difficult as we have 15 pins that need to be driven in seven states but a young engineer thinks he might have developed a solution in his personal time. It'd require us to relay out the board though.

The 1.7x is much larger variation than we've ever seen before that makes me hope maybe there's something else going on that we can fix. The most frustrating part is my code from last year uses an assembly function call and no attention to the overcomplicated device features...yet it V&Vs to show perfect timing! There must be some difference but we just can't figure it out.

Piranha · ‎2023-08-07

KnarfB already noted about L1 cache memories...

Also DWT and SysTick are located in the CPU, therefore accesses to these shouldn't have any delays.