Why is there so much latency in my interrupt?

arnold_w · ‎2022-02-15

I am working with the STM32 Nucleo-64 development board, which has an STM32L476R microcontroller. My SystemCoreClock is 16 MHz and TIM17 is clocked at 4 MHz. To my surprise, the code below only works well (the timer doesn't miss the next interrupt and wraps around) if I increment with at least 23:

#pragma GCC push_options
#pragma GCC optimize ("O3")
 
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
    GPIOC->BSRR = 0x00000400;  // Set test pin PC10 high
    GPIOC->BSRR = 0x04000000;  // Set test pin PC10 low
    TIM17->SR = 0;             // Clear interrupt flags
    TIM17->CCR1 += 23;         // OK
//    TIM17->CCR1 += 22;         // Not ok
}
 
#pragma GCC pop_options

Now, 23 timer ticks corresponds to 23 x 4 = 92 CPU clock cycles and it seems unlikely that the 4 lines of code would occupy 92 instructions. When I store the TIM17->CNT value in a global variable first thing in the interrupt routine above I can see that TIM17->CNT is 8 (!) more than TIM17->CCR1 meaning it took roughly 8 x 4 = 32 CPU clock instructions just to enter the interrupt routine! I tried to put the interrupt vector in RAM, but that made it worse! What am I doing wrong, why is there so much latency in my interrupt?

arnold_w · ‎2022-02-15

"What do you mean by "wraps around"? Repeated ISR calls because of the late interrupt-source clear?"

If I add a too small number (22 or less) in the last line in the interrupts routine, then TIM17->CNT have already surpassed the new TIM17->CCR1 value and the timer will count all the up to 65535 and wrap around and then trigger the interrupt when it reaches TIM17->CCR1, which will be 65536 timer ticks (=16.384 ms) too late.

I tried changing the following STM32CubeIDE 1.6.1 (Properties->C/C++ Build->Settings->MCU Settings), but it made no difference:

Floating-point Unit: "FPv4-SP-D16" -> "None"

Floating-point ABI: "Hardware implementation (-mfloat-abi=hard)" -> "Software implementation (-mfloat-abi=soft)"

Runtime library: "Reduced C (--specs=nano.specs)" -> "Standard C"

The ASPEN and LSPEN bits of the FPU->FPCCR register were both 1 (=enabled), but setting them to 0 didn't make a difference either.

The CPU is doing pretty much nothing when the interrupt occurs, it's just executing __WFI().

arnold_w · ‎2022-02-15

If I move my interrupt vector to RAM2 (address 0x10000000) then I can add 22 in the last row and it still works fine.

waclawek.jan · ‎2022-02-15

> If I add a too small number (22 or less) in the last line in the interrupts routine, then TIM17->CNT

> have already surpassed the new TIM17->CCR1 value

~~I don't think you're interpreting this correctly. This would happen only if there would be more than said 4*22 cycles between~~

~~TIM17->CCR1 += 23; // OK~~

~~802ef1a: 6b5a ldr r2, [r3, #52] ; 0x34~~

~~802ef1c: bc10 pop {r4}~~

~~TIM17->CCR1 += 23; // OK~~

~~802ef1e: 3217 adds r2, #23~~

~~802ef20: 635a str r2, [r3, #52] ;~~

~~and that I find unprobable, unless you're running some other interrupt, too.~~

[EDIT] in a bout of stupidity I somehow thought it's CCR1 = CNT + delta [EDIT]

> The CPU is doing pretty much nothing when the interrupt occurs, it's just executing __WFI().

Oh, sleep? But you did not mention that previously!

And which form of it, exactly? That might be a significant game changer; wakeup from varous sleep modes is not instantaneous.

Try without it, just a while(1);

JW

waclawek.jan · ‎2022-02-15

One more thing,

> SystemCoreClock is 16 MHz and TIM17 is clocked at 4 MHz.

Are you sure? Are you aware of the fact that if APB divider is > 1 then TIM clock is twice the APB frequency?

JW

S.Ma · ‎2022-02-15

Performancd also depends on compiler.

I would toggle insteas of doing a pulse glitch in the isr for better visibility.

Are there other interrupts with same or higher priority such as systick which could be the seen side effect?

The core should have a cycle counter debug hw register to grab timestamp, and your can also check for timer overfloe in the ISR to count statistically the missed rate and how to recovdr from it.

I think all this sums up to the WCET https://www.coursera.org/lecture/real-time-embedded-theory-analysis/methods-to-determine-worst-case-execution-time-wcet-Y1Jbd

arnold_w · ‎2022-02-16

If I get rid of __WFI() then it works fine if I add 21 (but not any lower) in the last line. However, if I modify my code and store TIM17->CNT into a global variable (not a stack variable) first thing in the interrupt routine, then it is 8 more than TIM17->CCR1, meaning it takes 8 x 4 = 32 CPH clock cycles until the first line of code is executed.

arnold_w · ‎2022-02-16

Yes, e.g. when I add 21 in the last line, I can see that I get a pulse on my oscilloscope approximately every 5.246 microseconds and this is very close to the theoretical number (21 x 250 nanoseconds = 5.250 microseconds).

arnold_w · ‎2022-02-16

There are no other interrupts. The code is just test code to measure performance, my end goal is to implement this: https://en.wikipedia.org/wiki/UNI/O

Danish1 · ‎2022-02-16

ST have examples of how to use DMA + a timer to emulate UART or SPI where the CPU only needs to intervene once every byte - the bit-pattern for transmit being set up in succeeding words in memory that are then DMA'd to GPIO->BSRR. Maybe it's worth reading up on those.

I seem to remember they were pretty clever in using capture on the timer to make a note of when an incoming edge happened, so it didn't matter what the interrupt latency was, as long as it was sufficiently small not to miss any edges.

Hope this helps,

Danish

S.Ma · ‎2022-02-16

My quick guess is to know when the decision to ack or nack as slave. Otherwise, use hw assist as much as possible, for example use usarts in clock mode and feed its clock to spi in double bit size. The 10 or 01 transistion become 2 bits. Then short miso and mosi and control the output enable.... this is just a quick superficial thought....