F7 is slower than my F4, F7 running on double the frequency

HL?�t · ‎2020-07-05

Hi,

I posted a question on StackExchange about this topic:

https://electronics.stackexchange.com/questions/508828/which-microcontroller-for-a-program-with-many-floating-point-operations

My STM32F4 (running on 100Mhz) is somehow faster than my STM32F7 (running on 200Mhz) and I really don't know what I could do about it. I wrote a small test programm and the rest is equally generated from CubeMX. FPU support is activated for both and I use the CubeMX generated settings, except I deactivated the optimization of the compiler, but also with optimization the same problem occurs.

Has anyone an idea, what I could further try? I changed from atollic studio 9.3 (truestudio toolchain) to the STM32CubeIDE and here the F7 runs even slower (cycle time increased to 11ms from 9.9ms)

RMcCa · ‎2020-07-05

How is the processor connected to the flash? If you study the block diagram of the f7 you will see that there are 2 different data busses you can use. The ART accelerator and it's data bus are much faster. Also, double check that the fpu is properly started and that the compiler is generating the right instructions

HL?�t · ‎2020-07-05

Thank you, it was really the ART Accelerator, that was deactivated. Now it runs properly.

Piranha · ‎2020-07-05

As strange as it sounds, but on Cortex-M7 with 8+8 KB or more cache actually ART isn't the fastest. AXI bus with I-cache is even faster. Look at AN4667 table 8.

RMcCa · ‎2020-07-06

Interesting, that runs counter to my experience with f730. Do the performances of the I-cache vs. ART have anything to do with the structure of the code?

I tried 3 different arrangements, measuring execution speed by setting/clearing a gpio pin at the beginning and end of the main loop.

I found itcm flash with ART and dtcm ram to about the same speed as itcm ram and dtcm ram, not axi flash with I-cache which was noticeably slower.

Could there some interaction between compiler optimisation and the I-cache behavior vs. ART? Before i read the app note, all i had read about ART is that it can approach 0 wait-states, so my results seemed logical.