2022-02-15 02:28 AM
I am working with the STM32 Nucleo-64 development board, which has an STM32L476R microcontroller. My SystemCoreClock is 16 MHz and TIM17 is clocked at 4 MHz. To my surprise, the code below only works well (the timer doesn't miss the next interrupt and wraps around) if I increment with at least 23:
#pragma GCC push_options
#pragma GCC optimize ("O3")
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
TIM17->SR = 0; // Clear interrupt flags
TIM17->CCR1 += 23; // OK
// TIM17->CCR1 += 22; // Not ok
}
#pragma GCC pop_options
Now, 23 timer ticks corresponds to 23 x 4 = 92 CPU clock cycles and it seems unlikely that the 4 lines of code would occupy 92 instructions. When I store the TIM17->CNT value in a global variable first thing in the interrupt routine above I can see that TIM17->CNT is 8 (!) more than TIM17->CCR1 meaning it took roughly 8 x 4 = 32 CPU clock instructions just to enter the interrupt routine! I tried to put the interrupt vector in RAM, but that made it worse! What am I doing wrong, why is there so much latency in my interrupt?
Solved! Go to Solution.
2022-02-16 08:45 AM
> I don't make a function call inside the interrupt routine, timerCntValue will be assigned the threshold
> TIM17->CCR1 plus 5 timer ticks.
So 20 system clocks. That's consistent with 12 system clocks of ISR latency + 2 system clocks to load r3 and r1 from FLASH + (0..7) + 8 system (AHB) clocks to get through the AHB/APB bridge with APB running at AHB/8 in order to load r0 from TIM_CNT. Give or take a couple of clocks, for I don't know exactly how things are synchronized.
JW
2022-02-15 03:01 AM
You're using the 'L476 with Floating-point unit (FPU). One possible reason for choosing L476 is you use FPU in your main code. Please confirm if you're using FPU.
The reason I say this is that Cortex M4 has many more registers to put onto the stack (the entire FPU state) if you're using the FPU. Unless, that is, you allow "lazy state preservation for floating-point context".
This sort of thing is covered in the Cortex M4 Programming Manual PM0214. My copy is Revision 7, and Figure 12 on p43 shows some 25 registers needing to be put onto the stack, each taking a processor cycle.
To enable / disable lazy state-preservation you need to look at the ASPEN and LSPEN bits of Floating-point context control register FPCCR - section 4.6.2 of PM0214.
That's one possible reason for slow exception entry.
Danish
2022-02-15 03:16 AM
Also check that you have 0-wait state memory configured (flash). Access to GPIO and other peripherals is also >1 cycles.
For lower impact, try
__asm__ volatile ("sev": : :"memory");
instead of toggling a GPIO. The SEV signal can be observed on pins configured as EVENTOUT type.
You may also observe SysClock on a MCO pin and a timer channel generated output to compare.
hth
KnarfB
2022-02-15 03:35 AM
No, I'm not doing any mathematics in my code and I tried changing the following in STM32CubeIDE 1.6.1 (Properties->C/C++ Build->Settings->MCU Settings), but it made no difference:
Floating-point Unit: "FPv4-SP-D16" -> "None"
Floating-point ABI: "Hardware implementation (-mfloat-abi=hard)" -> "Software implementation (-mfloat-abi=soft)"
Runtime library: "Reduced C (--specs=nano.specs)" -> "Standard C"
2022-02-15 03:44 AM
The pin toggling is just for testing, the goal is to implement interrupt-driven UNI/O protocol. I let STM32CubeMX create the project for me and I've confirmed that the LATENCY-bits of the FLASH->ACR register are all 0.
2022-02-15 04:15 AM
Look at the generated code, or write in assembler.
The context save is apt to take 12 cycles, then whatever you push and perhaps 4 cycles to the APB targets. The volatile flagging of the peripheral registers may also confound optimization.
2022-02-15 04:22 AM
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
802ef04: 4a07 ldr r2, [pc, #28] ; (802ef24 <TIM1_TRG_COM_TIM17_IRQHandler+0x20>)
GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
TIM17->SR = 0; // Clear interrupt flags
802ef06: 4b08 ldr r3, [pc, #32] ; (802ef28 <TIM1_TRG_COM_TIM17_IRQHandler+0x24>)
void TIM1_TRG_COM_TIM17_IRQHandler(void) {
802ef08: b410 push {r4}
GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
802ef0a: f04f 6080 mov.w r0, #67108864 ; 0x4000000
GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
802ef0e: f44f 6480 mov.w r4, #1024 ; 0x400
TIM17->SR = 0; // Clear interrupt flags
802ef12: 2100 movs r1, #0
GPIOC->BSRR = 0x00000400; // Set test pin PC10 high
802ef14: 6194 str r4, [r2, #24]
GPIOC->BSRR = 0x04000000; // Set test pin PC10 low
802ef16: 6190 str r0, [r2, #24]
TIM17->SR = 0; // Clear interrupt flags
802ef18: 6119 str r1, [r3, #16]
TIM17->CCR1 += 23; // OK
802ef1a: 6b5a ldr r2, [r3, #52] ; 0x34
// TIM17->CCR1 += 22; // Not ok
}
802ef1c: bc10 pop {r4}
TIM17->CCR1 += 23; // OK
802ef1e: 3217 adds r2, #23
802ef20: 635a str r2, [r3, #52] ; 0x34
}
802ef22: 4770 bx lr
802ef24: 48000800 stmdami r0, {fp}
802ef28: 40014800 andmi r4, r1, r0, lsl #16
2022-02-15 05:37 AM
Here is a STM32L432 running at 10 MHz with a little instrumentation:
3rd row: MCO out @ 10 MHz
2nd row: TIM1 CH4 used as PWM with pulse=20. IRQ triggered w.r.t. falling edge
1st row: EVENTOUT signal
The EVENTOUT signal is triggered high in larger chunks (left, right) in the main loop, only low when the loop branch occurs:
while (1)
{
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
__asm__ volatile ("sev": : :"memory");
/* USER CODE END WHILE */
/* USER CODE BEGIN 3 */
}
and also triggered in the IRQ handler (2 short pulses in 1st row):
void TIM1_CC_IRQHandler(void)
{
/* USER CODE BEGIN TIM1_CC_IRQn 0 */
__asm__ volatile ("sev": : :"memory");
__HAL_TIM_CLEAR_IT(&htim1, TIM_IT_CC4);
__asm__ volatile ("sev": : :"memory");
return; // do not use HAL Handler
This sums up to 38 clock cycles, 2 of them could be spared (sev from IRQ)
Why don't you increase MCU clock?
hth
KnarfB
2022-02-15 07:07 AM
As Clive said, 12 cycles is the hardware stacking upon ISR entry alone - and that is unless FPU is on and lazy-stacking is off (but that combination shouldn't happen without some effort). And then there's destacking upon ISR exit, and that takes maybe around 10 cycles, too (again without the FP registers); so that's your baseline ISR duration without a single instruction to execute. Tail-chaining and late-arrival ISRs are smart tweaks but don't help with the baseline.
Interrupt latency includes also the length of longest uninterruptible instruction in the interrupted code, as Danish remarked above. That shouldn't last too long, some longer instructions are interruptible/restartable; I am not looking up the details.
All this is assuming no system latency, i.e. no-waitstate memories. System latencies make things only worse. Fetching through S-port of processor (i.e. from RAM mapped above 0x2000'0000) makes things worse (S-port reads including fetches imply a 1-cycle penalty because of typical heavy loading of the S-port). Conflicts between code fetches and data access (again case of S-port) makes things worse.
Also note, that accessing registers in timer (i.e. through the /4 APB bus) impose delays related both to the slow bus, and to the need to sync between the fast and slow bus. These may add up to surprising lengths (see my post with scope screenshot and its explanation, below "More answers").
So, in short, this is not your friendly 8-bitter with easily predictible timing, anymore. Generally, the tricks employed to increase raw clock frequeny work against latencies and jitter so, that resulting "realtime" performance won't really change that much from the low-10MHz 8-bitters.
> the code below only works well (the timer doesn't miss the next interrupt and wraps around)
What do you mean by "wraps around"? Repeated ISR calls because of the late interrupt-source clear?
JW
2022-02-15 07:55 AM
The ASPEN and LSPEN bits of the FPU->FPCCR register were both 1 (=enabled), but setting them to 0 didn't make a difference either.