2018-08-29 02:51 AM
Hello,
I just started with a Nucleo F439ZI and did some tests to examine interrupt latency and code execution times. Here are my results and a number of questions:
I set the system to the maximum speed using CubeMX, so I have HCLK and FCLK = 180MHz, APB1/2 Timer clk = 90MHz, APB1/2 periph clk = 45MHz.
I generate a 10 kHz PWM signal with TIM2 and output it on channel 1 on pin PA5. Also I made this event trigger an interrupt and the isr is this, outputting a short puls on PA6:
void TIM2_IRQHandler (void)
{
//SetPortBit (port_Testpins, pin_Timer2_isr);
GPIOA->BSRR=1U<<(6);
__NOP ();
//ClearPortBit (port_Testpins, pin_Timer2_isr);
GPIOA->BSRR=(1U<<16)<<(6);
ClearBitMask (TIM2->SR, TIM_SR_CC1IF);
}
180MHz is 5,5 nsec per instruction so I first expected the given 12 cycles (eg here http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka16366.html) to result in a time between the falling edge of the PWM signal and the rising edge on PA6 of about 70 nsec. But I got between 194 and 216 nsec, jittering.
Then I found, that CubeMX also set my system to 5 wait states and looking into ref manual I found that running the controller at 30 MHz needs no wait states ie 1 CPU cycle per flash read which is 33 nsec. Running it at 180 MHz needs 5 wait states, so 6 CPU cycles per flash read which is also 33 nsec.
I also found options instruction prefetch and instruction cache in FLASH->ACR. Enabling both reduces the time to between 144 nsec and 177 nsec.
Strange for my understanding is that commenting out the nop instruction reduces jitter significantly while not decreasing the pulse width.
Changing compiler optimization level from none to speed I get a very short pulse without nop, but the same interrupt latency while with nop I get a latency of 134 nsec without any jitter.
I would appreciate some background information on these behaviours, especially what instruction prefetch and cache are doing and what is their use in case of an isr being loaded.
Then I would like to ask where precisely the higher clock frequency reduces execution speed when the flash read cycles do not benefit from it, I was very surprised to find this.
Is it an option to copy the isr to RAM and execute it there? Will this reduce interrupt latency? Is code running in RAM executing faster than from flash? Where can I find some information about how to do this, isr and normal C-functions. I couldn't find anything detailed.
Thanks a lot for any help and information
Martin
2018-08-31 01:01 AM
In Cortex-M7, things are complicated further by its complex prefetch unit with (manufacturer-configurable) branch prediction, and the dual-issue execution.
> what speed can you execute out of SRAM ? is it HCLK/2 ?
There is no HCLK in the 'H7. The CPU clock is labelled rcc_c_ck.
RM0433: The ITCM is accessed by Cortex-M7 at CPU clock speed, with zero wait states.
Other SRAMs are not intended to run code, although generally it's not excluded.
The AXI SRAM is connected through the AXI bus matrix, so it (presumably) runs at rcc_aclk, and caching may apply to it. Other SRAMs are beyond matrix-to-matrix interconnects, so there are presumably synchronization delays associated.
> what speed for Cache ?
It never occured me that it could be anything different from the CPU clock speed. I can't find written it in any material, probably because it would be very weird to be otherwise.
But that's not the interesting part, at all. The key question is, how fast the cache is filled when it comes to execution of any particular part of code, and what exactly is the state of the cache at that moment, i.e. whether the code-to-be-executed is already there from previous instances of execution, and haven't they been already replaced by other code.
> Flash ?
FLASHes are on AXI bus so cacheable, and they have a limited access speed set as the number of waitstates.
> can you execute from Sdram ?
Yes you can. Timing of SDRAM is a very complex question, involving also caching in FMC, for details I recommend you to thoroughly read the FMC chapter in RM. Plus whatever applies to AXI-connected peripherals (including caching).
Generally, there's little precise information available publicly on the exact timing of individual peripherals (by which I mean also the various memories) and their interconnects, so if you are genuinely interested and have the appropriate buying power (as expressed in megabucks per year), you ought to contact ST directly.
As I've said above, generally, the "high speed" in the 32-bitters is more in the "number crunching" sense than "low latency, low jitter" sense, and the "higher" the speed, the more this applies. And, the "higher" the "speed", the higher is complexity, the lower the chance for getting full and exact public information on the timing details, and the lower the chance that any general benchmark will be relevant for estimating performance in a particular setting.
JW
2018-09-03 01:01 AM
Hello all,
thanks for your explanations. They are very welcome and give me valueable background information.
Martin
2018-09-03 03:47 AM
"to result in a time between the falling edge of the PWM signal and the rising edge on PA6 of about 70 nsec."
70nsc translates into 12 cycles, a little bit too short for the overhead + ISR execution. I often budget for 20 - 40 ticks from the isr invocation to the beginning of its execution, depending on a variety of factors.
"But I got between 194 and 216 nsec, jittering."
without anything else going on in the system, however, jitter should be zero.
2020-11-18 12:34 AM
Hi
Do you know how long it will take for the STM32H7 from the assertion of the interrupt request up to the cycle where the first instruction of the interrupt handler is ready to be expected?
Running in 400MHz.
2020-11-18 01:30 AM
I thought it may be clean from this rather lengthy thread, that the answer is "no, it's very complicated and depends on many factors, order of magnitude more than on the 'F4 (which is already very complicated)".
JW
2020-11-19 07:02 AM
I've just measured about 400ns.
Is that reasonable?
Flash WS=4 and MCU frequency = 400Mhz.
2020-11-19 07:50 PM
I don't know. What are you measuring, what code exactly (and by that I mean*all* code running), which memories are allocated, on what hardware, and how do you measure it?
As I've said, there are too many variables even in a "simpler" 'F4, in 'H7 they blow the roof.
Generally, in 32-bitters, without paying extra attention to the many details, interrupt latencies don't improve with increasing clock, not even over 8-bitters, and jitter tends to get worse.
JW
2020-11-21 10:20 AM
I've taken two Nucleo H7 boards and connected them via IO.
The first board sets an output IO to high, the same IO is is connected to the other board on one of the EXTI interrupt request lines.
On the first line of code of the IRQ Handler of the other board, I set another IO to high.
I then measured, using an oscilloscope, the difference between the IO going high on the first board and the IO going high on the second board.