Thomas Legrand

STM32F7 Speeds and caches

Discussion created by Thomas Legrand on Jan 23, 2018
Latest reply on Apr 29, 2018 by HarjitS



I'm starting a project with an STM32F746VG, so I'm playing with the adequate Nucleo board.


I wanted to compare the speed based on different optimizations of the CPU, based on the exact same code, just with different optimizations on (everything is compiled on O0).


I did my experiment based on the cycle count for the Systick interrupt, this is not something very scientific, but gives an idea of the result, and using SEGGER SystemView it is very fast to instrument.


First I started on normal FLASH memory access, no ART, no cache, no prefetch, ISR takes 1128 cycles, that's our 100%.

Enabling ART of prefetch does nothing, this is expected as the documentation clearly states they do work only on ITCM access.


So let's move the executable code to ITCM FLASH memory, this is done by changing the linker script (>ITCMFLASH AT >FLASH instead of >FLASH).

No ART, prefetch, cache, Systick ISR takes 946 cycles, that's 84% our reference, so 16% faster.


Let's enable ICache then, ISR takes 947 cycles, I may have messed up something, but enable ICache via SCB_EnableICache does nothing to performance ...


Now let's enable prefetch, ISR takes 914 cycles, that's 81% our reference, so 19% faster.


And last but (as you will see) not least, enable ART, ISR takes ... 681 cycles, that's 60% of the reference, 40% faster, almost twice as fast. Keep in mind it is the EXACT SAME CODE just moving the code to ITCM FLASH and enabling ART.


I did a last test, which was enabling only ART and not prefetch, the result was 681 too.

Again, just measuring the Systick ISR cycle time is just a very poor instrumentation, but damn, almost twice as fast just by tweaking the linker script and enabling the ART and prefetch !!!


First information, the same code with ITCM FLASH, ART, prefetch, Os optimization, and LTO enabled gives a ridiculous 254 cycles ISR time, that is more than 4 time faster than the reference (4.44 times faster to be exact).


So take the time to evaluate and instrument your code, there is no point paying an ultra fast MCU to not use its full potential.


By the way, I did try to enable the DCache, but that crashes the program ... did not have time to investigate, but I don't get why (using SCB_EnableDCache()).