cancel
Showing results for 
Search instead for 
Did you mean: 

I recently ported my STM32F413 (run at 100Mhz) application to a STM32H7A3 (run at 280-Mhz). But my DSP routine does not run 2.8x faster. Anyone has any idea how this happened?

nNg.1
Associate II

I measured the time took to run a length 127 Multiply–accumulate routine by togging a GPIO pin before and after the arm_dot_prod_q15() routine. And then I can measure pulse width by a scope. It was 7us on the STM32F413 and 6us on the STM32H7A3.

1 ACCEPTED SOLUTION

Accepted Solutions
AScha.3
Chief III

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

0693W00000VONkJQAX.png 

If you feel a post has answered your question, please click "Accept as Solution".

View solution in original post

10 REPLIES 10
Uwe Bonnes
Principal III

Check cache setting!

MM..1
Chief III

From where is your calc data? Maybe bus on both MCUs have close speed... FPU?

Compare all clocks config.

AScha.3
Chief III

... and optimizer setting !!!!

If you feel a post has answered your question, please click "Accept as Solution".

What cache? I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

It is a integer multiply-accumulate. I do not use the FPU.

By the STM32CubeMX, I am certain that the MCU is running at 280MHz. And other clocks are set to maximum of what are allowed already.

I set to optimize for speed already. Even if I did not, there can't be 2.8x difference.

AScha.3
Chief III

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

0693W00000VONkJQAX.png 

If you feel a post has answered your question, please click "Accept as Solution".
Piranha
Chief II

For a performance measurement measure at least tens or hundreds of thousands iterations and disable the interrupts during that measurement.

The Cortex-M7 in STM32H7A3 should be about 5 times faster than the Cortex-M4 in STM32F413 , not 2,8 times.

Thanks a lot. It works. It is reduced to less than 2us. The speed up is more than 2.8x.