Interrupt curiosity (STM32G491)

HenrikGlader · ‎2025-07-16

Hello friends : )

I've recently tested the NUCLEO-G491RE board for an upcoming redesign.

Current design relies on quite tough interrupt latency so here is where I begun.

As I understand, with no FPU usage (in isr) (ASPEN + LSPEN = 0) one could expect up to 12 SYSCLK latency.

In this case 75ns at 160MHz which sounds pretty decent (today it's about 73ns).

I started off by creating two very similar interrupt services I intended to toggle between.

Each would pulse a output pin and trigger the other one:

Attributes: 'interrupt' + optimize("-O2")' + 'section(".RamFunc")' + 'aligned(8)' + 'naked'

static void onTimeStampEvent(void)
{
  GPIOA->BSRR = 1<<12;
  NVIC->ISPR[1] = 1<<(39-1*32); // USART3
  GPIOA->BRR = 1<<12;
#ifdef NAKED
  __ASM volatile ("BX LR":::);
#endif
  return;
}

static void onReceiveEvent(void)
{
  GPIOB->BSRR = 1<<14;
  NVIC->ISPR[0] = 1<<(11-0*32); // DMA1_CH1
  GPIOB->BRR = 1<<14;
#ifdef NAKED
  __ASM volatile ("BX LR":::);
#endif
  return;
}

In main the usual suspects:

HAL and System initiation
Peripheral initiation
Setting up interrupts

Finally, the main loop:

 u = 0;
  while(1)
  {
    NVIC->ISPR[0] = 1<<(11-0*32); // DMA1_CH1
    u++;
  }

This works splendidly, sort of...

The total time for a complete round trip is about 381ns or 61 SYSCLK.

Variable "u" in main never changes from "0" suggesting expected continuous interrupts.

The thing is, would not tail-chaining occur?

Now I tried diversify priority levels, yellow being less important:

Priority in action, indeed, however now a round trip takes 562ns (90 SYSCLK).

Still no tail-chaining, and worse, lots of extra time for the same amount of work.

Where have I done wrong?

Any help appreciated = )

/Hen

waclawek.jan · ‎2025-07-23

Try moving stack to CCMRAM.

Fetching the ISR address from vector table should occur in parallel with the registers stacking, so they should be in different memories but at the same time the vector fetch should have less of an impact if it lasts longer.

These are very complex SoC-s, where cycle counting is cumbersome due to the many elements involved, and the theoretical numbers from the processor's specs alone are in practice usually impossible to reach, again due to the huge influence of the whole SoC. Generally, consider the processor's specs to be just sweet marketing speech.

JW

HenrikGlader · ‎2025-07-24

Yes, fetching the vector probably occur once every interrupt taken.

And due to the initial multi stacking, the above may be less important regarding latency.

However, wouldn't the stack starve the instruction pipe, residing in CCMRAM as well?

Regarding cycle counting, yes there's a lot going on in parallel here and exact numbers may not exist.

That's not the core of my hacks, just a feel for what's feasible and concurrent comparisons.

I think I can get away with about 60ns latency spread on The One important event.

Today it's 30ns. Well, assuming no degraded performance, which also could be considered.

Then the unexpected scenario happened, interrupt escalation with performance penalty.

That's the real bugger, is it not?

waclawek.jan · ‎2025-07-24

Latency and its jitter is at least an order of magnitude worse in these SoCs than in the 8-bit micro*controllers*, and so is its controllability and state of documentation. So, I don't consider interrupts to be a viable option for timing sensitive tasks anymore, and always resort to hardware.

JW

HenrikGlader · ‎2025-07-24

Yes, so it is.

I do utilize DMA:s as the snippet may reveal.

The competing core is also a *modern* MCU albeit a bit more recent than the G4.

It is quite up to the task, but for 'platformics' we need to move here.

In the near future we aim at the H7+

waclawek.jan · ‎2025-07-24

> In the near future we aim at the H7+

The interrupt latency and overall timing controllability Cortex-M7 is of course worse than in Cortex-M4.

That's the price we pay for raw speed.

JW

HenrikGlader · ‎2025-07-26

Indeed, the M7 core is even more suitable as a multi task core.

The better option is of course the asymmetric dual-core H755.

Some years ago, I did a redesign with this one, over LPC43S67 (tri-core).

HenrikGlader · ‎2025-07-26

I just realized this may not work for us in this way?

CCMRAM seems to be the best performance option.

Tried several linkages for stack (at highest address) vs ram-code with not so obvious results.

The stack (at high end CCM area) and the ram-code (at the low end CCM area).

1. Both in 0x10000000[0x4000] --- 52ck vs 67ck.

2. Both in 0x20018000[0x4000] --- 72ck vs 86ck.

3. stack in 0x1xxx code in 0x2xxx --- 72ck vs 92ck.

4. stack in 0x2xxx code in 0x1xxx --- 26ck vs 69ck. ???

There is no spread present because this is the only thing the core does.

What happens when all bells and whistles are in place?

Will the spread (for the most prominent interrupt) be more than 60ns (10sc)?

I understand the tests do not say squat about this but them gives worries...

HenrikGlader · ‎2025-07-30

I finally got some in-system test result - just above 30ns spread.

That is excellent, however the full implementation may knock that smile off later.

Soon the holiday season ends, and we may start this endeavor.

Thank you for all inputs.

HenrikGlader · ‎2025-08-02

Continues to find answers...

I think I've measured or registered something else in 4. above, last post by July the 26th.

Cannot reproduce and now gets --- 52ck vs 69ck, which removes the "???".

4. stack in 0x2xxx code in 0x1xxx --- 52ck vs 69ck.

I swapped the priority, i.e letting the lesser important task take lead.

5. stack in 0x2xxx code in 0x1xxx --- 52ck vs 72ck.

That would mean different totals depending on the sources.

It's still a hefty penalty having escalating interrupts than not.

Well, it depends on perspective, I suppose, but 14 to 20 SYSCLK is notable in my book.

BTW, somewhere in this rabbit hole I was not thinking clearly, talking of tail-chaining.

This cannot happen in above examples, only waiting interrupts can. Sorry for that.

HenrikGlader · ‎2025-08-06

I think the nickel has came down.

With same priority, i.e. no escalation, only one "enter/leave handler state" occurs (per source).

Escalation yields the same for the more prioritized source, but not the other one.

During escalation, the "enter/leave part" halts the other one, hence doubling it's penalty.

Am I on to something or am I trapped in this mist (as usual)?

BTW, this would minimize the spread on "The One", which I seem to recognize in system.

Note, it's not the same code running in real, so there's room for differences.