Project migration from STM32F405RG to STM32G473RE

Greg H · ‎2019-11-19

STM32G4 CPU speed. I am migrating a project from the STM32F405RG to the STM32G473RE. System clock frequency for both is 168MHz. I have found the code runs at about half speed on the STM32G473 and I would like to know why and how to fix it.

The STM32G473 flash is in dual bank mode and has data cache, instruction cache and prefetch enabled with 8 wait states. Will switching to single bank mode, increasing flash data width from 64 bit to 128 bit to match the STM32F405, restore the speed or is there more likely another cause? Will running the code in SRAM restore speed, as I would prefer to stay in dual bank mode.

Greg H · ‎2019-11-26

Speed Problem Fixed.

By moving the main interrupt service routine to CCM RAM I obtained considerable speed improvement. Here are the computation time comparisons for the ISR:

STM32F405 from FLASH: 7 microseconds

STM32G473 from FLASH in dual bank mode: 14 microseconds

STM32G473 from CCM RAM: 6 microseconds

The code is heavy with branching so I suspect the combination of increased wait states and the reduced FLASH bus width contributed to the problem.

Thank you everyone for your comments

GH

View solution in original post

waclawek.jan · ‎2019-11-20

Is your application computationally intensive, or moving around large amounts of data?

You should perhaps benchmark with some simple code resembling your application, taking out all the peripherals and similar stuff from the equation, before you make conclusions.

JW

Greg H · ‎2019-11-20

The application is computationally intensive. It is a field oriented control motor drive operating in integer mode (the FPU is not used). The code is mostly short instructions with low data movement. I will try some bench marking but first I would like to know if there is a likely speed downgrade when switching from 128 bit to 64 bit flash data access.

waclawek.jan · ‎2019-11-20

If you have a linear stream of 16-bit (on-halfword) instructions with little data movement, then executing 64 bits takes 4 cycles and executing 128 bits takes 8 cycles. With 8 waitstates per FLASH read, in former case you'll wait 5 cycles and in latter 1 cycle, i.e. you'll execute 4 or 8 instructions in 9 cycles. So in that case, yes, FLASH bus width will have a significant impact, maybe that's exactly the root of the halving of observed execution speed.

Two-halfword-instructions, loads, consecutive saves especially to slower memories - this all slows down the execution i.e. diminishes the impact of FLASH width. I would need to think about the impact of portion of jumps served from the jumpcache but that percentage is not entirely trivial to establish anyway.

JW

Greg H · ‎2019-11-21

Its looking like FLASH bus width is the problem.

I will try a few things starting with moving critical code to CCM RAM and post results.

waclawek.jan · ‎2019-11-22

> I will try a few things starting with moving critical code to CCM RAM and post results.

Note, that there are several things to bear in mind when doing that as well - e.g. you don't want to have code and data in the same memory (especially if there are more masters such as DMA working on those data in that memory), and you want to run the code through the I+D ports of processor, i.e. mapped below 0x2000'0000. And don't forget, that local variables and stack are data too.

There was a benchmark in one of the Technical Updates https://community.st.com/s/question/0D50X00009Xkba4SAB/stm32-technical-updates for the 'F4, demonstrating the detrimental effect of code going through the S port.

JW

Singh.Harjit · ‎2019-11-22

I have been planning to use the STM32GT474 for FOC control, too. So, I'm very interested in your results. To get an estimate of the complexity of what you are doing, can you share are you using sensored or sensorless FOC? Are you using ST's library or your own or?

Thanks.

Greg H · ‎2019-11-24

I use my own version of FOC which is sensorless. Details are available at http://www.ghunter.net/Sensorless_PMSM.html. I don't use any of ST's libraries. I prefer to use direct peripheral register access for fastest speed.

To move the code to SRAM, I am using the excellent application note AN4296: Use STM32F3/STM32G4 CCM SRAM with IARTM EWARM, Keil® MDK-ARM and

GNU-based toolchains. It covers all the aspects JW brought up.

waclawek.jan · ‎2019-11-25

> excellent application note AN4296

Except it's been converted into the nauseating "modern feel and look".

JW

STOne-32 · ‎2019-11-25

Dear @Greg H ,

Can you detail if the code is really executing from a Dual bank and there is a portion of one bank being programmed on the fly while executing the FOC loop ? In that case, of course critical loop should be placed in the CCM RAM which is provided for such usage case.

Having a 50 % of performance loss in terms of MIPs at a Cortex-M4 is like clocking at half speed that means at 84 MHz instead of 168MHz and is not comparing same Apples, if possible to have at least a portion of the code so we can reproduce in the same configuration as you on our STM32G4 Nucleo , tagging my colleagues and moderators to help @Amel NASRI @Imen DAHMEN on this investigation.

Cheers,

STOne-32