Run code from ram improve the performances?

Paolo Andreuzza · ‎2020-06-23

Hallo.

In other mcu (not ARM) i’ve worked in the past, there were possibe to move some parts of time critical code to ram, and run them from there to get the best performance, since the flash mamory access were slower then the ram access.

In STM32 mcu this tecnique gives real advantages in performance?

I've seen that some STM32 mcu (but not all) have specific hardware to improve flash access speed (ART Accelerator...), om these platform, running code from ram improves even more the performances or is not a good idea?

In particular, In STM32G070, that seems to don't have any specific hardware for flash management, running code from ram does improve significantly the performances, or is it not worth?

Thank you in advance.

Paolo

berendi · ‎2020-06-23

If the flash is operating with wait states, then yes, it would improve performance. No idea how much, it is a typical YMMV. Look up FLASH read access latency in the reference manual.

Consider other ways to increase code performance, like enabling link time optimization, or rewriting general library functions to handle only your use case instead.

RMcCa · ‎2020-06-23

I experimented with this on a F730, comparing using the ART accelerator vs. Copying and running time-critical routines from ITCM ram. It didn't make any significant difference, i suspect that the ART accelerator works pretty well making flash access about as fast as the tightly coupled memory. From this i concluded that the main intent of the 16k of IT

RMcCa · ‎2020-06-23

.... 16k of ITCM ram is for custom bootloaders.

berendi · ‎2020-06-23

It would depend on the type of workload. E.g. having the vector table and ISR in ITCM might improve interrupt latency by a few cycles, but it would not matter at all if the ISR then calls a HAL interrupt handler with 600+ cycles overhead.

And the point of the question is that the STM32G0 series has no ART.

Paolo Andreuzza · ‎2020-06-23

This is the point.

The STM32L475 datasheet says: Core: Arm® 32-bit Cortex®-M4 CPU with FPU, Adaptive real-time accelerator (ART Accelerator™) allowing 0-wait-state execution

from Flash memory.

But my G0 doesn't have ART accelerator, so in addition to activating all the optimizations at compiling time, I should also run the critical code from the ram, am I correct?

TDK · ‎2020-06-23

> I should also run the critical code from the ram, am I correct?

Only you are going to be able to answer that. It will increase the speed of execution, but at the cost of using up RAM. Measure performance when running from flash and from RAM and decide which is better. It will also be code dependent.

If you feel a post has answered your question, please click "Accept as Solution".

Paolo Andreuzza · ‎2020-06-23

Thank you, guys.

So could we say that where the ART Accelerator is not present, the execution of code from the ram memory it improves the performance, to the detriment of the occupation of the memory itself? Therefore it will be necessary to make the appropriate assessments for the choice of the best solution

waclawek.jan · ‎2020-06-23

You can't transfer conclusions from a ~100MHz Cortex-M4 (yet alone M7) to a ~50MHz Cortex-M0+ with a quite different architecture, entirely different port structure, within an entirely different bus and peripherals set.

As said above, benchmark your application.

JW

Tesla DeLorean · ‎2020-06-23

I haven't reviewed the G0, not using it here, but historically with ST's CM0(+) designs the increase in wait state at 24 or 32 MHz is quite detrimental to code execution.

A CM0 at 24 MHz with zero wait states will outperform, a CM0 at 25 MHz with one wait state. You'd have to clock faster to find the break-even point.

So yes in this case migrating critical code to RAM, including the vector table, can help. If you can run at 48 or 64 MHz from RAM, I'd probably do that.

ST's implementations tend to be RAM poor.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..