Why is there so much latency in my interrupt?

arnold_w · ‎2022-02-15

I am working with the STM32 Nucleo-64 development board, which has an STM32L476R microcontroller. My SystemCoreClock is 16 MHz and TIM17 is clocked at 4 MHz. To my surprise, the code below only works well (the timer doesn't miss the next interrupt and wraps around) if I increment with at least 23:

#pragma GCC push_options
#pragma GCC optimize ("O3")
 
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
    GPIOC->BSRR = 0x00000400;  // Set test pin PC10 high
    GPIOC->BSRR = 0x04000000;  // Set test pin PC10 low
    TIM17->SR = 0;             // Clear interrupt flags
    TIM17->CCR1 += 23;         // OK
//    TIM17->CCR1 += 22;         // Not ok
}
 
#pragma GCC pop_options

Now, 23 timer ticks corresponds to 23 x 4 = 92 CPU clock cycles and it seems unlikely that the 4 lines of code would occupy 92 instructions. When I store the TIM17->CNT value in a global variable first thing in the interrupt routine above I can see that TIM17->CNT is 8 (!) more than TIM17->CCR1 meaning it took roughly 8 x 4 = 32 CPU clock instructions just to enter the interrupt routine! I tried to put the interrupt vector in RAM, but that made it worse! What am I doing wrong, why is there so much latency in my interrupt?

waclawek.jan · ‎2022-02-16

> I don't make a function call inside the interrupt routine, timerCntValue will be assigned the threshold

> TIM17->CCR1 plus 5 timer ticks.

So 20 system clocks. That's consistent with 12 system clocks of ISR latency + 2 system clocks to load r3 and r1 from FLASH + (0..7) + 8 system (AHB) clocks to get through the AHB/APB bridge with APB running at AHB/8 in order to load r0 from TIM_CNT. Give or take a couple of clocks, for I don't know exactly how things are synchronized.

JW

View solution in original post

Danish1 · ‎2022-02-15

You're using the 'L476 with Floating-point unit (FPU). One possible reason for choosing L476 is you use FPU in your main code. Please confirm if you're using FPU.

The reason I say this is that Cortex M4 has many more registers to put onto the stack (the entire FPU state) if you're using the FPU. Unless, that is, you allow "lazy state preservation for floating-point context".

This sort of thing is covered in the Cortex M4 Programming Manual PM0214. My copy is Revision 7, and Figure 12 on p43 shows some 25 registers needing to be put onto the stack, each taking a processor cycle.

To enable / disable lazy state-preservation you need to look at the ASPEN and LSPEN bits of Floating-point context control register FPCCR - section 4.6.2 of PM0214.

That's one possible reason for slow exception entry.

Danish

KnarfB · ‎2022-02-15

Also check that you have 0-wait state memory configured (flash). Access to GPIO and other peripherals is also >1 cycles.

For lower impact, try

__asm__ volatile ("sev": : :"memory");

instead of toggling a GPIO. The SEV signal can be observed on pins configured as EVENTOUT type.

You may also observe SysClock on a MCO pin and a timer channel generated output to compare.

hth

KnarfB

arnold_w · ‎2022-02-15

No, I'm not doing any mathematics in my code and I tried changing the following in STM32CubeIDE 1.6.1 (Properties->C/C++ Build->Settings->MCU Settings), but it made no difference:

Floating-point Unit: "FPv4-SP-D16" -> "None"

Floating-point ABI: "Hardware implementation (-mfloat-abi=hard)" -> "Software implementation (-mfloat-abi=soft)"

Runtime library: "Reduced C (--specs=nano.specs)" -> "Standard C"

arnold_w · ‎2022-02-15

The pin toggling is just for testing, the goal is to implement interrupt-driven UNI/O protocol. I let STM32CubeMX create the project for me and I've confirmed that the LATENCY-bits of the FLASH->ACR register are all 0.

Tesla DeLorean · ‎2022-02-15

Look at the generated code, or write in assembler.

The context save is apt to take 12 cycles, then whatever you push and perhaps 4 cycles to the APB targets. The volatile flagging of the peripheral registers may also confound optimization.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

arnold_w · ‎2022-02-15

void TIM1_TRG_COM_TIM17_IRQHandler(void) {
    GPIOC->BSRR = 0x00000400;  // Set test pin PC10 high
 802ef04:	4a07      	ldr	r2, [pc, #28]	; (802ef24 <TIM1_TRG_COM_TIM17_IRQHandler+0x20>)
    GPIOC->BSRR = 0x04000000;  // Set test pin PC10 low
    TIM17->SR = 0;             // Clear interrupt flags
 802ef06:	4b08      	ldr	r3, [pc, #32]	; (802ef28 <TIM1_TRG_COM_TIM17_IRQHandler+0x24>)
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
 802ef08:	b410      	push	{r4}
    GPIOC->BSRR = 0x04000000;  // Set test pin PC10 low
 802ef0a:	f04f 6080 	mov.w	r0, #67108864	; 0x4000000
    GPIOC->BSRR = 0x00000400;  // Set test pin PC10 high
 802ef0e:	f44f 6480 	mov.w	r4, #1024	; 0x400
    TIM17->SR = 0;             // Clear interrupt flags
 802ef12:	2100      	movs	r1, #0
    GPIOC->BSRR = 0x00000400;  // Set test pin PC10 high
 802ef14:	6194      	str	r4, [r2, #24]
    GPIOC->BSRR = 0x04000000;  // Set test pin PC10 low
 802ef16:	6190      	str	r0, [r2, #24]
    TIM17->SR = 0;             // Clear interrupt flags
 802ef18:	6119      	str	r1, [r3, #16]
    TIM17->CCR1 += 23;         // OK
 802ef1a:	6b5a      	ldr	r2, [r3, #52]	; 0x34
//    TIM17->CCR1 += 22;         // Not ok
}
 802ef1c:	bc10      	pop	{r4}
    TIM17->CCR1 += 23;         // OK
 802ef1e:	3217      	adds	r2, #23
 802ef20:	635a      	str	r2, [r3, #52]	; 0x34
}
 802ef22:	4770      	bx	lr
 802ef24:	48000800 	stmdami	r0, {fp}
 802ef28:	40014800 	andmi	r4, r1, r0, lsl #16

KnarfB · ‎2022-02-15

Here is a STM32L432 running at 10 MHz with a little instrumentation:

3rd row: MCO out @ 10 MHz

2nd row: TIM1 CH4 used as PWM with pulse=20. IRQ triggered w.r.t. falling edge

1st row: EVENTOUT signal

The EVENTOUT signal is triggered high in larger chunks (left, right) in the main loop, only low when the loop branch occurs:

while (1)
  {
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
	  __asm__ volatile ("sev": : :"memory");
 
    /* USER CODE END WHILE */
 
    /* USER CODE BEGIN 3 */
  }

and also triggered in the IRQ handler (2 short pulses in 1st row):

void TIM1_CC_IRQHandler(void)
{
  /* USER CODE BEGIN TIM1_CC_IRQn 0 */
 
	  __asm__ volatile ("sev": : :"memory");
	  __HAL_TIM_CLEAR_IT(&htim1, TIM_IT_CC4);
	  __asm__ volatile ("sev": : :"memory");
	  return; // do not use HAL Handler

This sums up to 38 clock cycles, 2 of them could be spared (sev from IRQ)

Why don't you increase MCU clock?

hth

KnarfB

waclawek.jan · ‎2022-02-15

As Clive said, 12 cycles is the hardware stacking upon ISR entry alone - and that is unless FPU is on and lazy-stacking is off (but that combination shouldn't happen without some effort). And then there's destacking upon ISR exit, and that takes maybe around 10 cycles, too (again without the FP registers); so that's your baseline ISR duration without a single instruction to execute. Tail-chaining and late-arrival ISRs are smart tweaks but don't help with the baseline.

Interrupt latency includes also the length of longest uninterruptible instruction in the interrupted code, as Danish remarked above. That shouldn't last too long, some longer instructions are interruptible/restartable; I am not looking up the details.

All this is assuming no system latency, i.e. no-waitstate memories. System latencies make things only worse. Fetching through S-port of processor (i.e. from RAM mapped above 0x2000'0000) makes things worse (S-port reads including fetches imply a 1-cycle penalty because of typical heavy loading of the S-port). Conflicts between code fetches and data access (again case of S-port) makes things worse.

Also note, that accessing registers in timer (i.e. through the /4 APB bus) impose delays related both to the slow bus, and to the need to sync between the fast and slow bus. These may add up to surprising lengths (see my post with scope screenshot and its explanation, below "More answers").

So, in short, this is not your friendly 8-bitter with easily predictible timing, anymore. Generally, the tricks employed to increase raw clock frequeny work against latencies and jitter so, that resulting "realtime" performance won't really change that much from the low-10MHz 8-bitters.

> the code below only works well (the timer doesn't miss the next interrupt and wraps around)

What do you mean by "wraps around"? Repeated ISR calls because of the late interrupt-source clear?

JW

arnold_w · ‎2022-02-15

The ASPEN and LSPEN bits of the FPU->FPCCR register were both 1 (=enabled), but setting them to 0 didn't make a difference either.