I recently ported my STM32F413 (run at 100Mhz) application to a STM32H7A3 (run at 280-Mhz). But my DSP routine does not run 2.8x faster. Anyone has any idea how this happened?

nNg.1 · ‎2022-10-28

I measured the time took to run a length 127 Multiply–accumulate routine by togging a GPIO pin before and after the arm_dot_prod_q15() routine. And then I can measure pulse width by a scope. It was 7us on the STM32F413 and 6us on the STM32H7A3.

AScha.3 · ‎2022-10-28

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

If you feel a post has answered your question, please click "Accept as Solution".

View solution in original post

Uwe Bonnes · ‎2022-10-28

Check cache setting!

MM..1 · ‎2022-10-28

From where is your calc data? Maybe bus on both MCUs have close speed... FPU?

Compare all clocks config.

AScha.3 · ‎2022-10-28

... and optimizer setting !!!!

If you feel a post has answered your question, please click "Accept as Solution".

nNg.1 · ‎2022-10-28

What cache? I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

nNg.1 · ‎2022-10-28

I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

It is a integer multiply-accumulate. I do not use the FPU.

By the STM32CubeMX, I am certain that the MCU is running at 280MHz. And other clocks are set to maximum of what are allowed already.

nNg.1 · ‎2022-10-28

I set to optimize for speed already. Even if I did not, there can't be 2.8x difference.

AScha.3 · ‎2022-10-28

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

If you feel a post has answered your question, please click "Accept as Solution".

Piranha · ‎2022-10-28

For a performance measurement measure at least tens or hundreds of thousands iterations and disable the interrupts during that measurement.

The Cortex-M7 in STM32H7A3 should be about 5 times faster than the Cortex-M4 in STM32F413 , not 2,8 times.

nNg.1 · ‎2022-10-28

Thanks a lot. It works. It is reduced to less than 2us. The speed up is more than 2.8x.