Skip to main content
nNg.1
Associate II
October 28, 2022
Solved

I recently ported my STM32F413 (run at 100Mhz) application to a STM32H7A3 (run at 280-Mhz). But my DSP routine does not run 2.8x faster. Anyone has any idea how this happened?

  • October 28, 2022
  • 5 replies
  • 2571 views

I measured the time took to run a length 127 Multiply–accumulate routine by togging a GPIO pin before and after the arm_dot_prod_q15() routine. And then I can measure pulse width by a scope. It was 7us on the STM32F413 and 6us on the STM32H7A3.

This topic has been closed for replies.
Best answer by AScha.3

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

0693W00000VONkJQAX.png 

5 replies

Uwe Bonnes
Chief
October 28, 2022

Check cache setting!

nNg.1
nNg.1Author
Associate II
October 28, 2022

What cache? I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

MM..1
Chief III
October 28, 2022

From where is your calc data? Maybe bus on both MCUs have close speed... FPU?

Compare all clocks config.

nNg.1
nNg.1Author
Associate II
October 28, 2022

I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

It is a integer multiply-accumulate. I do not use the FPU.

By the STM32CubeMX, I am certain that the MCU is running at 280MHz. And other clocks are set to maximum of what are allowed already.

AScha.3
Super User
October 28, 2022

... and optimizer setting !!!!

"If you feel a post has answered your question, please click ""Accept as Solution""."
nNg.1
nNg.1Author
Associate II
October 28, 2022

I set to optimize for speed already. Even if I did not, there can't be 2.8x difference.

AScha.3
AScha.3Best answer
Super User
October 28, 2022

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

0693W00000VONkJQAX.png 

"If you feel a post has answered your question, please click ""Accept as Solution""."
nNg.1
nNg.1Author
Associate II
October 28, 2022

Thanks a lot. It works. It is reduced to less than 2us. The speed up is more than 2.8x.

Piranha
Principal III
October 28, 2022

For a performance measurement measure at least tens or hundreds of thousands iterations and disable the interrupts during that measurement.

The Cortex-M7 in STM32H7A3 should be about 5 times faster than the Cortex-M4 in STM32F413 , not 2,8 times.

nNg.1
nNg.1Author
Associate II
October 28, 2022

Yes, indeed. It is more than 2.8x, and close to 5x.