2022-10-28 02:56 AM
I measured the time took to run a length 127 Multiply–accumulate routine by togging a GPIO pin before and after the arm_dot_prod_q15() routine. And then I can measure pulse width by a scope. It was 7us on the STM32F413 and 6us on the STM32H7A3.
Solved! Go to Solution.
2022-10-28 09:59 AM
7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :
can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .
and cache setting - i would try this (+ try : use normal memory )
2022-10-28 07:23 AM
Check cache setting!
2022-10-28 07:26 AM
From where is your calc data? Maybe bus on both MCUs have close speed... FPU?
Compare all clocks config.
2022-10-28 07:33 AM
... and optimizer setting !!!!
2022-10-28 08:16 AM
What cache? I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.
2022-10-28 08:23 AM
I put the critical code in ITCM and data in DTCM. The difference is small. By the manual, the core takes more than 1 cycle/data, so I am not surprise that DTCM does not speed up the routine.
It is a integer multiply-accumulate. I do not use the FPU.
By the STM32CubeMX, I am certain that the MCU is running at 280MHz. And other clocks are set to maximum of what are allowed already.
2022-10-28 08:24 AM
I set to optimize for speed already. Even if I did not, there can't be 2.8x difference.
2022-10-28 09:59 AM
7 -> 6 us --- not what we expect. just very short test time...to make it shure, that not pin switching or something else comes in :
can you make a loop and run your test 1000x ? then see on scope 7 -> ? ms .
and cache setting - i would try this (+ try : use normal memory )
2022-10-28 04:28 PM
For a performance measurement measure at least tens or hundreds of thousands iterations and disable the interrupts during that measurement.
The Cortex-M7 in STM32H7A3 should be about 5 times faster than the Cortex-M4 in STM32F413 , not 2,8 times.
2022-10-28 04:40 PM
Thanks a lot. It works. It is reduced to less than 2us. The speed up is more than 2.8x.