Senior II

Solved

Why is there so much latency in my interrupt?

Forum|Forum|4 years ago
February 15, 2022
12 replies
8851 views

I am working with the STM32 Nucleo-64 development board, which has an STM32L476R microcontroller. My SystemCoreClock is 16 MHz and TIM17 is clocked at 4 MHz. To my surprise, the code below only works well (the timer doesn't miss the next interrupt and wraps around) if I increment with at least 23:

#pragma GCC push_options
#pragma GCC optimize ("O3")
 
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
 GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
 GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
 TIM17->SR = 0; // Clear interrupt flags
 TIM17->CCR1 += 23; // OK
// TIM17->CCR1 += 22; // Not ok
}
 
#pragma GCC pop_options

Now, 23 timer ticks corresponds to 23 x 4 = 92 CPU clock cycles and it seems unlikely that the 4 lines of code would occupy 92 instructions. When I store the TIM17->CNT value in a global variable first thing in the interrupt routine above I can see that TIM17->CNT is 8 (!) more than TIM17->CCR1 meaning it took roughly 8 x 4 = 32 CPU clock instructions just to enter the interrupt routine! I tried to put the interrupt vector in RAM, but that made it worse! What am I doing wrong, why is there so much latency in my interrupt?

This topic has been closed for replies.

Best answer by waclawek.jan

> I don't make a function call inside the interrupt routine, timerCntValue will be assigned the threshold

> TIM17->CCR1 plus 5 timer ticks.

So 20 system clocks. That's consistent with 12 system clocks of ISR latency + 2 system clocks to load r3 and r1 from FLASH + (0..7) + 8 system (AHB) clocks to get through the AHB/APB bridge with APB running at AHB/8 in order to load r0 from TIM_CNT. Give or take a couple of clocks, for I don't know exactly how things are synchronized.

JW

Danish1

Lead III

You're using the 'L476 with Floating-point unit (FPU). One possible reason for choosing L476 is you use FPU in your main code. Please confirm if you're using FPU.

The reason I say this is that Cortex M4 has many more registers to put onto the stack (the entire FPU state) if you're using the FPU. Unless, that is, you allow "lazy state preservation for floating-point context".

This sort of thing is covered in the Cortex M4 Programming Manual PM0214. My copy is Revision 7, and Figure 12 on p43 shows some 25 registers needing to be put onto the stack, each taking a processor cycle.

To enable / disable lazy state-preservation you need to look at the ASPEN and LSPEN bits of Floating-point context control register FPCCR - section 4.6.2 of PM0214.

That's one possible reason for slow exception entry.

Danish

arnold_wAuthor

Senior II

No, I'm not doing any mathematics in my code and I tried changing the following in STM32CubeIDE 1.6.1 (Properties->C/C++ Build->Settings->MCU Settings), but it made no difference:

Floating-point Unit: "FPv4-SP-D16" -> "None"

Floating-point ABI: "Hardware implementation (-mfloat-abi=hard)" -> "Software implementation (-mfloat-abi=soft)"

Runtime library: "Reduced C (--specs=nano.specs)" -> "Standard C"

KnarfB

Super User

Also check that you have 0-wait state memory configured (flash). Access to GPIO and other peripherals is also >1 cycles.

For lower impact, try

__asm__ volatile ("sev": : :"memory");

instead of toggling a GPIO. The SEV signal can be observed on pins configured as EVENTOUT type.

You may also observe SysClock on a MCO pin and a timer channel generated output to compare.

hth

KnarfB

arnold_wAuthor

Senior II

The pin toggling is just for testing, the goal is to implement interrupt-driven UNI/O protocol. I let STM32CubeMX create the project for me and I've confirmed that the LATENCY-bits of the FLASH->ACR register are all 0.

KnarfB

Super User

Here is a STM32L432 running at 10 MHz with a little instrumentation:

3rd row: MCO out @ 10 MHz

2nd row: TIM1 CH4 used as PWM with pulse=20. IRQ triggered w.r.t. falling edge

1st row: EVENTOUT signal

The EVENTOUT signal is triggered high in larger chunks (left, right) in the main loop, only low when the loop branch occurs:

while (1)
 {
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
	 __asm__ volatile ("sev": : :"memory");
 
 /* USER CODE END WHILE */
 
 /* USER CODE BEGIN 3 */
 }

and also triggered in the IRQ handler (2 short pulses in 1st row):

void TIM1_CC_IRQHandler(void)
{
 /* USER CODE BEGIN TIM1_CC_IRQn 0 */
 
	 __asm__ volatile ("sev": : :"memory");
	 __HAL_TIM_CLEAR_IT(&htim1, TIM_IT_CC4);
	 __asm__ volatile ("sev": : :"memory");
	 return; // do not use HAL Handler

This sums up to 38 clock cycles, 2 of them could be spared (sev from IRQ)

Why don't you increase MCU clock?

hth

KnarfB

Tesla DeLorean

Guru

Look at the generated code, or write in assembler.

The context save is apt to take 12 cycles, then whatever you push and perhaps 4 cycles to the APB targets. The volatile flagging of the peripheral registers may also confound optimization.

Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..

arnold_wAuthor

Senior II

void TIM1_TRG_COM_TIM17_IRQHandler(void) {
 GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
 802ef04:	4a07 	ldr	r2, [pc, #28]	; (802ef24 <TIM1_TRG_COM_TIM17_IRQHandler+0x20>)
 GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
 TIM17->SR = 0; // Clear interrupt flags
 802ef06:	4b08 	ldr	r3, [pc, #32]	; (802ef28 <TIM1_TRG_COM_TIM17_IRQHandler+0x24>)
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
 802ef08:	b410 	push	{r4}
 GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
 802ef0a:	f04f 6080 	mov.w	r0, #67108864	; 0x4000000
 GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
 802ef0e:	f44f 6480 	mov.w	r4, #1024	; 0x400
 TIM17->SR = 0; // Clear interrupt flags
 802ef12:	2100 	movs	r1, #0
 GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
 802ef14:	6194 	str	r4, [r2, #24]
 GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
 802ef16:	6190 	str	r0, [r2, #24]
 TIM17->SR = 0; // Clear interrupt flags
 802ef18:	6119 	str	r1, [r3, #16]
 TIM17->CCR1 += 23; // OK
 802ef1a:	6b5a 	ldr	r2, [r3, #52]	; 0x34
// TIM17->CCR1 += 22; // Not ok
}
 802ef1c:	bc10 	pop	{r4}
 TIM17->CCR1 += 23; // OK
 802ef1e:	3217 	adds	r2, #23
 802ef20:	635a 	str	r2, [r3, #52]	; 0x34
}
 802ef22:	4770 	bx	lr
 802ef24:	48000800 	stmdami	r0, {fp}
 802ef28:	40014800 	andmi	r4, r1, r0, lsl #16

waclawek.jan

Super User

As Clive said, 12 cycles is the hardware stacking upon ISR entry alone - and that is unless FPU is on and lazy-stacking is off (but that combination shouldn't happen without some effort). And then there's destacking upon ISR exit, and that takes maybe around 10 cycles, too (again without the FP registers); so that's your baseline ISR duration without a single instruction to execute. Tail-chaining and late-arrival ISRs are smart tweaks but don't help with the baseline.

Interrupt latency includes also the length of longest uninterruptible instruction in the interrupted code, as Danish remarked above. That shouldn't last too long, some longer instructions are interruptible/restartable; I am not looking up the details.

All this is assuming no system latency, i.e. no-waitstate memories. System latencies make things only worse. Fetching through S-port of processor (i.e. from RAM mapped above 0x2000'0000) makes things worse (S-port reads including fetches imply a 1-cycle penalty because of typical heavy loading of the S-port). Conflicts between code fetches and data access (again case of S-port) makes things worse.

Also note, that accessing registers in timer (i.e. through the /4 APB bus) impose delays related both to the slow bus, and to the need to sync between the fast and slow bus. These may add up to surprising lengths (see my post with scope screenshot and its explanation, below "More answers").

So, in short, this is not your friendly 8-bitter with easily predictible timing, anymore. Generally, the tricks employed to increase raw clock frequeny work against latencies and jitter so, that resulting "realtime" performance won't really change that much from the low-10MHz 8-bitters.

> the code below only works well (the timer doesn't miss the next interrupt and wraps around)

What do you mean by "wraps around"? Repeated ISR calls because of the late interrupt-source clear?

JW

arnold_wAuthor

Senior II

"What do you mean by "wraps around"? Repeated ISR calls because of the late interrupt-source clear?"

If I add a too small number (22 or less) in the last line in the interrupts routine, then TIM17->CNT have already surpassed the new TIM17->CCR1 value and the timer will count all the up to 65535 and wrap around and then trigger the interrupt when it reaches TIM17->CCR1, which will be 65536 timer ticks (=16.384 ms) too late.

I tried changing the following STM32CubeIDE 1.6.1 (Properties->C/C++ Build->Settings->MCU Settings), but it made no difference:

Floating-point Unit: "FPv4-SP-D16" -> "None"

Floating-point ABI: "Hardware implementation (-mfloat-abi=hard)" -> "Software implementation (-mfloat-abi=soft)"

Runtime library: "Reduced C (--specs=nano.specs)" -> "Standard C"

The ASPEN and LSPEN bits of the FPU->FPCCR register were both 1 (=enabled), but setting them to 0 didn't make a difference either.

The CPU is doing pretty much nothing when the interrupt occurs, it's just executing __WFI().

arnold_wAuthor

Senior II

If I move my interrupt vector to RAM2 (address 0x10000000) then I can add 22 in the last row and it still works fine.

waclawek.jan

Super User

> If I add a too small number (22 or less) in the last line in the interrupts routine, then TIM17->CNT

> have already surpassed the new TIM17->CCR1 value

~~I don't think you're interpreting this correctly. This would happen only if there would be more than said 4*22 cycles between~~

~~TIM17->CCR1 += 23; // OK~~

~~802ef1a: 6b5a ldr r2, [r3, #52] ; 0x34~~

~~802ef1c: bc10 pop {r4}~~

~~TIM17->CCR1 += 23; // OK~~

~~802ef1e: 3217 adds r2, #23~~

~~802ef20: 635a str r2, [r3, #52] ;~~

~~and that I find unprobable, unless you're running some other interrupt, too.~~

[EDIT] in a bout of stupidity I somehow thought it's CCR1 = CNT + delta [EDIT]

> The CPU is doing pretty much nothing when the interrupt occurs, it's just executing __WFI().

Oh, sleep? But you did not mention that previously!

And which form of it, exactly? That might be a significant game changer; wakeup from varous sleep modes is not instantaneous.

Try without it, just a while(1);

JW

arnold_wAuthor

Senior II

If I get rid of __WFI() then it works fine if I add 21 (but not any lower) in the last line. However, if I modify my code and store TIM17->CNT into a global variable (not a stack variable) first thing in the interrupt routine, then it is 8 more than TIM17->CCR1, meaning it takes 8 x 4 = 32 CPH clock cycles until the first line of code is executed.

waclawek.jan

Super User

One more thing,

> SystemCoreClock is 16 MHz and TIM17 is clocked at 4 MHz.

Are you sure? Are you aware of the fact that if APB divider is > 1 then TIM clock is twice the APB frequency?

JW

arnold_wAuthor

Senior II

Yes, e.g. when I add 21 in the last line, I can see that I get a pulse on my oscilloscope approximately every 5.246 microseconds and this is very close to the theoretical number (21 x 250 nanoseconds = 5.250 microseconds).

S.Ma

Principal

Performancd also depends on compiler.

I would toggle insteas of doing a pulse glitch in the isr for better visibility.

Are there other interrupts with same or higher priority such as systick which could be the seen side effect?

The core should have a cycle counter debug hw register to grab timestamp, and your can also check for timer overfloe in the ISR to count statistically the missed rate and how to recovdr from it.

I think all this sums up to the WCET https://www.coursera.org/lecture/real-time-embedded-theory-analysis/methods-to-determine-worst-case-execution-time-wcet-Y1Jbd

arnold_wAuthor

Senior II

There are no other interrupts. The code is just test code to measure performance, my end goal is to implement this: https://en.wikipedia.org/wiki/UNI/O

Danish1

Lead III

ST have examples of how to use DMA + a timer to emulate UART or SPI where the CPU only needs to intervene once every byte - the bit-pattern for transmit being set up in succeeding words in memory that are then DMA'd to GPIO->BSRR. Maybe it's worth reading up on those.

I seem to remember they were pretty clever in using capture on the timer to make a note of when an incoming edge happened, so it didn't matter what the interrupt latency was, as long as it was sufficiently small not to miss any edges.

Hope this helps,

Danish

S.Ma

Principal

My quick guess is to know when the decision to ack or nack as slave. Otherwise, use hw assist as much as possible, for example use usarts in clock mode and feed its clock to spi in double bit size. The 10 or 01 transistion become 2 bits. Then short miso and mosi and control the output enable.... this is just a quick superficial thought....

Show more replies

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded