My results when examining interrupt latency of a STM32F439

Mr_M_from_G · ‎2018-08-29

Hello,

I just started with a Nucleo F439ZI and did some tests to examine interrupt latency and code execution times. Here are my results and a number of questions:

I set the system to the maximum speed using CubeMX, so I have HCLK and FCLK = 180MHz, APB1/2 Timer clk = 90MHz, APB1/2 periph clk = 45MHz.

I generate a 10 kHz PWM signal with TIM2 and output it on channel 1 on pin PA5. Also I made this event trigger an interrupt and the isr is this, outputting a short puls on PA6:

void TIM2_IRQHandler (void)
{
  //SetPortBit (port_Testpins, pin_Timer2_isr);
  GPIOA->BSRR=1U<<(6);
  __NOP ();
  //ClearPortBit (port_Testpins, pin_Timer2_isr);
  GPIOA->BSRR=(1U<<16)<<(6);
  ClearBitMask (TIM2->SR, TIM_SR_CC1IF);
}

180MHz is 5,5 nsec per instruction so I first expected the given 12 cycles (eg here http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka16366.html) to result in a time between the falling edge of the PWM signal and the rising edge on PA6 of about 70 nsec. But I got between 194 and 216 nsec, jittering.

Then I found, that CubeMX also set my system to 5 wait states and looking into ref manual I found that running the controller at 30 MHz needs no wait states ie 1 CPU cycle per flash read which is 33 nsec. Running it at 180 MHz needs 5 wait states, so 6 CPU cycles per flash read which is also 33 nsec.

I also found options instruction prefetch and instruction cache in FLASH->ACR. Enabling both reduces the time to between 144 nsec and 177 nsec.

Strange for my understanding is that commenting out the nop instruction reduces jitter significantly while not decreasing the pulse width.

Changing compiler optimization level from none to speed I get a very short pulse without nop, but the same interrupt latency while with nop I get a latency of 134 nsec without any jitter.

I would appreciate some background information on these behaviours, especially what instruction prefetch and cache are doing and what is their use in case of an isr being loaded.

Then I would like to ask where precisely the higher clock frequency reduces execution speed when the flash read cycles do not benefit from it, I was very surprised to find this.

Is it an option to copy the isr to RAM and execute it there? Will this reduce interrupt latency? Is code running in RAM executing faster than from flash? Where can I find some information about how to do this, isr and normal C-functions. I couldn't find anything detailed.

Thanks a lot for any help and information

Martin

waclawek.jan · ‎2018-08-29

> 180MHz is 5,5 nsec per instruction so I first expected the given 12 cycles (eg here

> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka16366.html) to result in a time

> between the rising edge of the PWM signal and the rising edge on PA6 of about 70 nsec.

12 cycles is how long the processor interrupt entry procedure lasts. Look at the disassembly of the ISR to see how many instructions have to be executed until the one which results in the output to change.

> I would appreciate some background information on these behaviours, especially what instruction

> prefetch and cache are doing

Instruction prefetch is a mechanism where the FLASH controller continues to read linearly one more FLASH line even if the previous line has not been consumed by the processor completely.

The ART cache is a jumpcache, storing words at target addresses of the last 64 jumps. If a jump is found to land at one of these entries, they are served from the cache rather than read from FLASH (which would be at the full speed penalty of the waitstates). This speeds up tight loops with moderate number of branching significantly.

> Is code running in RAM executing faster than from flash?

Not necessarily on a 'F4xx (see e.g. https://community.st.com/s/question/0D50X00009Xkh2VSAR/the-cpu-performance-difference-between-running-in-the-flash-and-running-in-the-ram-for-stm32f407 and https://community.st.com/s/question/0D50X00009XkaPCSAZ/stm32f40x-168mhz-wait-states-and-execution-from-ram - the Technical Update mentioned there is long gone but maybe you could excavate it from the depths of the net with some directed searching [EDIT] to my surprise I found them as attachments to https://community.st.com/s/question/0D50X00009Xkba4SAB/stm32-technical-updates [/EDIT] ). This is a tricky issue with many issues involved - and many depend on the exact definition of "executing faster".

> where precisely the higher clock frequency reduces execution speed when the flash read cycles do not benefit from it

You meant probably "reduces execution time". In executing long linear code, complex multi-cycle instructions (like division) and short loops (where the jumpcache helps).

Speed comes at the cost of complexity; and almost always it is in the number-crunching meaning, rather than in low-latency/high-reaction-speed/low-jitter meaning of the word "speed". In most cases the only reasonable way to found out the answer to a particular question is to benchmark the various options (in a knowledgeable way, i.e. using minimal example but with involvement of all potential obstacles). And, in many cases where initial calculation with megahertzs might indicate optimism, it turns out that it's impossible to achieve the goal purely in software and some hardware - built-in or external - has to be involved.

I don't think there is a single exhaustive source of information on these issues - you have to do your reading through all available ARM and ST material.

JW

T J · ‎2018-08-29

on the H7, The flash is 256 bits wide,

thereby a single read produces 8x 32bit instructions.

designed to accommodate, single cycle flash execution.

STM32H753xI     32-bit Arm® Cortex®-M7 400MHz MCUs, up to 2MB Flash,1MB RAM, 46 com. and analog interfaces, crypto
 
16 Kbytes of instruction cache allowing one cache line to be filled in a single access from the 256-bit embedded Flash memory; frequency up to 400 MHz, MPU, 856 DMIPS / 2.14 DMIPS/MHz  (Dhrystone 2.1), and DSP instructions

Mr_M_from_G · ‎2018-08-30

Hello all,

thanks for your answers.

I did some experiments with just toggling a port pin in an endless loop and found that it is really worth while using prefetch and cache and compiler optimize for speed. I found a factor of 4 times faster. But as Jan said this number is of little general information.

Martin

LMI2 · ‎2018-08-30

How long if you are using C or make your code with Cube MX?

waclawek.jan · ‎2018-08-30

> How long if you are using C or make your code with Cube MX?

I don't understand, please reformulate your question.

JW

Tesla DeLorean · ‎2018-08-30

I think it relates to looking at the code you're generating, because there are lots of loads and stores, and a lot context push/pop stuff.

As I've mentioned in other threads the ART is quite effective at masking the slowness of the FLASH. Compare this with the F1 design which is crippled by it.

My expectation is that the design uses both phases of the clock, and the cache has optimized paths to deliver words to the prefetch queue within the current cycle.

For predictable execution use SRAM.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

LMI2 · ‎2018-08-30

I cannot answer to individual post, only here. I asked about latencies when using C-language but the code is C.

The code in first post is short. What if the code/project is created with CubeMX. It has several interrupt routines, one after another and callbacks. They sure make things slower.

waclawek.jan · ‎2018-08-30

> I cannot answer to individual post, only here.

Ah, I see. A bit of context would have helped perhaps.

> I asked about latencies when using C-language but the code is C

There is a long way from a C-language program to machine code. The result depends quite heavily on optimization and other settings of the compiler.

> What if the code/project is created with CubeMX. It has several interrupt routines, one after another

> and callbacks. They sure make things slower.

More source code generally means more machine code, although some compilers (and more so the expensive ones) can do a surprising amount of optimization. Splitting things into several files, using variables instead of constants and using lots of indirections make the optimizer's task harder.

CubeMX generates code for Cube/HAL, which is intended to provide portability through abstraction. Code generation and portability mean certain amount of convenience, and that convenience comes at the cost of the code being suboptimal.

JW

T J · ‎2018-08-30

For predictable execution use SRAM

In regards to a H7 processor,

what speed can you execute out of SRAM ? is it HCLK/2 ?

what speed for Cache ? HCLK ?

Flash ?

can you execute from Sdram ?

how slow would that be ?