Associate II

Solved

I recently ported my STM32F413 (run at 100Mhz) application to a STM32H7A3 (run at 280-Mhz). But my DSP routine does not run 2.8x faster. Anyone has any idea how this happened?

Forum|Forum|3 years ago
October 28, 2022
5 replies
2571 views

I measured the time took to run a length 127 Multiply–accumulate routine by togging a GPIO pin before and after the arm_dot_prod_q15() routine. And then I can measure pulse width by a scope. It was 7us on the STM32F413 and 6us on the STM32H7A3.

This topic has been closed for replies.

Best answer by AScha.3

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

Uwe Bonnes

Chief

Check cache setting!

nNg.1Author

Associate II

What cache? I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

MM..1

Chief III

From where is your calc data? Maybe bus on both MCUs have close speed... FPU?

Compare all clocks config.

nNg.1Author

Associate II

I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.

It is a integer multiply-accumulate. I do not use the FPU.

By the STM32CubeMX, I am certain that the MCU is running at 280MHz. And other clocks are set to maximum of what are allowed already.

AScha.3

Super User

... and optimizer setting !!!!

"If you feel a post has answered your question, please click ""Accept as Solution""."

nNg.1Author

Associate II

I set to optimize for speed already. Even if I did not, there can't be 2.8x difference.

AScha.3Best answer

Super User

7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :

can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .

and cache setting - i would try this (+ try : use normal memory )

"If you feel a post has answered your question, please click ""Accept as Solution""."

nNg.1Author

Associate II

Thanks a lot. It works. It is reduced to less than 2us. The speed up is more than 2.8x.

Piranha

Principal III

For a performance measurement measure at least tens or hundreds of thousands iterations and disable the interrupts during that measurement.

The Cortex-M7 in STM32H7A3 should be about 5 times faster than the Cortex-M4 in STM32F413 , not 2,8 times.

nNg.1Author

Associate II

Yes, indeed. It is more than 2.8x, and close to 5x.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded