Performance of CM4 core of STM32H745 vs the CM4 of STM32G474

marcoaccame · ‎2024-06-21

Hi all,

when running the same piece of code (math operations with random access to RAM) on the CM4 of the STM32G474 we measure smaller execution times than on the CM4 core of the STM32H745.

The STM32G474 is clocked at 168 MHz and uses the SRAM mapped at 0x20000000.

The CM4 of the STM32H745 is clocked at 200 MHz (and its CM7 at 400 MHz) and uses AHB SRAM1 mapped at address 0x30000000.

How is it possible that the slower clocked CM4 core of the STM32G474 is more performant? Any explanations?

Thanks,

Marco Accame, Ph.D.

iCub Tech Facility, Istituto Italiano di Tecnologia

CRIS, via S.Quirico 19D, 16163 Genoa Italy
e-mail: marco.accame@iit.it

marcoaccame · ‎2024-07-04

Hi SofLit,

as per your request, the public repository https://github.com/icub-tech-iit/study-cm4-performances contains:

details of the achieved results
the used test code
instructions of how to reproduce the tests on development boards from STMicroelectonics using minimal projects

In short: we experience longer execution time up to 10% more in doing single precision floating point operations in the CM4 of the STM32H745 vs the faster CM4 on the STM32G474 both clocked at the same CPU speed and we cannot understand why.

Thanks, Marco.

STOne-32 · ‎2024-07-05

Dear @marcoaccame ,

is that possible to change the linker file for STM32H7 MCUs the Address of SRAM1 via S-Bus at @0x30000000 instead of 0x10000000 and check the performance again ?

in fact , when executing code from Flash it will use I-Bus but when we access SRAM at 0x10000000 ( alias )it will use the D-bus which have a latency over S-Bus on same physical SRAM at 0x30000000 .

Thanks again for the detailed Project, very impressive and informative.

Ciao

STOne-32

marcoaccame · ‎2024-07-08

hi STOne-32,

i run the tests and nothing changes. the times stays the same both for flash and ram placed code.

To be sure of using the 0x30000000 in here i report the .map file of the flash project.

STOne-32 · ‎2024-07-13

Dear @marcoaccame ,

Thank you for the details . This is my last wish to ask if possible to keep the load regions as is but to try in scatter file to move all execution regions as the following :

RO data and RO code - to SRAM1 @1000_0000

and all RW/stack/Heap as done to SRAM2 @3002_0000

For H7 - Cortex-M4

This will give us a better visibility of the impact of ART Flash is predominant or Architecture of the buses . My colleague @SofLit will share after our full analysis .

have a great weekend

Ciao

STOne-32

MM..1 · ‎2024-07-13

Hi , interesant theme, maybe i miss im not dual core expert, but this dual shares some thinks as flash , bus, rams ...

Exactly one core on 400 and second on 200 isnt real, every nth clock result to colision on flash or next parts.

Result speed for M4 isnt then real 200 , seems be lower as single core 168... Yes some caches ART and next can make think better, but real isnt full speed or ?

SofLit · ‎2024-07-15

Hello @marcoaccame and thank you for sharing your environment.
Unfortunately, I didn't succeed to build your projects since Keil project is broken: all files are missing in the project tree. Also I noticed you used C++ in your tests that we don't use in this kind of situation. We use only C.
Moreover, I didn't understand how you managed the code and data location as here is no specific scatter file for each config.
Also, for math functions, I'm not sure if you relocate the math objects to the right location: sin(), cos() etc ..
For this I created my own projects.
I've used two sort of benchmarks:
- The first one using BubbleSort algorithm.
- The second one: inspired from you code. Converted your algo from C++ to C language.
The conclusion was: there is no issue with H7/CM4 performance compared to G4/CM4 when the Flash accelerations are enabled. Same perf when the two MCUs are executing from the internal SRAMs.
I'm attaching:
- A .7z file that contains the workspaces I used for my benchmarks.
- A slides summarizing the work done: results and conclusions.

Hope it brings some answers to your questions 😉.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

marcoaccame · ‎2024-07-15

Hi SofLit,

thanks for your tests. I will have a look at them and let you know.

Just two comments.

About the Keil project having all files missing: you probably opened the project from here: https://github.com/icub-tech-iit/study-cm4-performances/tree/master/code/icubtech/projects and not from inside the STM32CubeH7 and STM32CubeG4 repos as described in here https://github.com/icub-tech-iit/study-cm4-performances/blob/master/code/how-to-run-tests.md

About C++: I don't see any reason why C++ cannot be used. It is my belief that the C++ used in our tests can be compiled as good as C. Even if not, it is the same code that we used as a comparison between G4 and H7.

Thanks again, Marco

SofLit · ‎2024-07-15

About C++: I don't see any reason why C++ cannot be used. It is my belief that the C++ used in our tests can be compiled as good as C. Even if not, it is the same code that we used as a comparison between G4 and H7.

As ST we don't use C++ for perf benchmarks.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

marcoaccame · ‎2024-07-15

Hi STOne-32,

I re-run the tests with the load regions as you suggested, see following scatter file,

and things runs better. It is the right path to follow.

Now one test runs the same on both G4 and H7. The other two tests runs slower on H7 but the extra time has dropped to 1.5% and 5% which can be acceptable.

I have updated our repository with your contribution: https://github.com/icub-tech-iit/study-cm4-performances/blob/master/docs/improvements.md

Thanks, Marco.

marcoaccame · ‎2024-07-16

Hi STOne-32 and SofLit,

thanks for your help in solving my request of speeding up some nasty floating point code.

The key point was to place the RW and ZI of the code to optimize inside the SRAM2 @ 0x3002-0000. And yes: to use ART which I had already enabled.

I have also tested RO and XO location both in SRAM1 and FLASH and they behave the same.

It is important however to use in the scatter file the SRAM2 @ 0x3002-0000 because its alias address 0x1002-0000 does not give the speed up.

After having solved all that I have two further questions:

1. why does placement of RW and ZI code in SRAM2 solve and it does not if we use SRAM1 for it?
2. why the SRAM2 must be used with address 0x3002-0000 and not with its alias 0x1002-0000?

Do you have an application note or a section in the manuals that describe the correct
use of these memory banks?

Thanks, Marco.