2022-12-04 10:50 PM
I'm using the STM32F769DI and the SMT32cCube for a project which will be using the CMSIS DSP libraries in a computationally demanding application.
My C code is set for fast optimisation. __FPU_USED and FPU_PRESENT are set. ARM_MATH_CM7 is defined. I compare two blocks of simple float32 multiplication using the CMSIS library and a simply loop. I'm linking in the arm_cortexM7lfdp_math library from the GCC directory. I can check the timing with a scope on LED2.
// 711 us
BSP_LED_On(LED2);
arm_mult_f32(x, y, z, 10000);
BSP_LED_Off(LED2);
// 50 us
BSP_LED_On(LED2);
for (int k = 0; k < 10000; k++) {
z[k] = x[k] * y[k];
}
BSP_LED_Off(LED2);
Does anyone have any insight into why the simple for loop should be so much faster than the SIMD based CMSIS library. Is there some additional initialisation necessary? That I've missed. I've tried compiling the CMSIS library into my own library but realised pretty much the same results?
Solved! Go to Solution.
2022-12-04 11:04 PM
what faster ? i see no values...
ed: ah, time in comment. 1: 10 speed? seems something wrong.
and most important: optimizer setting ? (try: -O2 )
ed2: aaah, 10000 fmul in 50us ?? is 5ns for a float mul 🙂
no, optimizer kicked it out, because "useless" .
set: xyz arrays as "volatile" .
then test again.
2022-12-04 11:04 PM
what faster ? i see no values...
ed: ah, time in comment. 1: 10 speed? seems something wrong.
and most important: optimizer setting ? (try: -O2 )
ed2: aaah, 10000 fmul in 50us ?? is 5ns for a float mul 🙂
no, optimizer kicked it out, because "useless" .
set: xyz arrays as "volatile" .
then test again.
2022-12-05 02:18 AM
i just tried to make a test : fmul on a H743 , at 200MHz core, -O2 , caches on.
i used: DWT counter
https://embeddedb.blogspot.com/2013/10/how-to-count-cycles-on-arm-cortex-m.html
1 * fmul -> 15 tics , 75ns float mul. (5 tics, 25ns with data in cache)
1 * double mul -> 14 tics, 70 ns double mul
both : 95ns , in x10 loop : 755 ns
so (with some action by optimizer and cache ) float mul should be about 25...70ns
+ the float mul is 1 tic, the other time is load/store of the registers + wait states.
using the dsp lib seem useless for basic math, what should be faster than 1 clk execution?
2022-12-06 07:45 PM
Thanks for that AScha. Once I made the arrays volatile I got more reasonable results. 1415 us for the for next loop and 650 us for the arm library function. These were just test loops to ensure that the library was working correctly. Ofast gives slightly faster results than O3 but only marginal.