cancel
Showing results for 
Search instead for 
Did you mean: 

CMSIS DSP Library performance

IWhit.2
Associate II

I'm using the STM32F769DI and the SMT32cCube for a project which will be using the CMSIS DSP libraries in a computationally demanding application.

My C code is set for fast optimisation. __FPU_USED and FPU_PRESENT are set. ARM_MATH_CM7 is defined. I compare two blocks of simple float32 multiplication using the CMSIS library and a simply loop. I'm linking in the arm_cortexM7lfdp_math library from the GCC directory. I can check the timing with a scope on LED2.

// 711 us

BSP_LED_On(LED2);

arm_mult_f32(x, y, z, 10000);

BSP_LED_Off(LED2);

// 50 us

BSP_LED_On(LED2);

for (int k = 0; k < 10000; k++) {

z[k] = x[k] * y[k];

}

BSP_LED_Off(LED2);

Does anyone have any insight into why the simple for loop should be so much faster than the SIMD based CMSIS library. Is there some additional initialisation necessary? That I've missed. I've tried compiling the CMSIS library into my own library but realised pretty much the same results?

1 ACCEPTED SOLUTION

Accepted Solutions
AScha.3
Chief II

what faster ? i see no values...

ed: ah, time in comment. 1: 10 speed? seems something wrong.

and most important: optimizer setting ? (try: -O2 )

ed2: aaah, 10000 fmul in 50us ?? is 5ns for a float mul 🙂

no, optimizer kicked it out, because "useless" .

set: xyz arrays as "volatile" .

then test again.

If you feel a post has answered your question, please click "Accept as Solution".

View solution in original post

3 REPLIES 3
AScha.3
Chief II

what faster ? i see no values...

ed: ah, time in comment. 1: 10 speed? seems something wrong.

and most important: optimizer setting ? (try: -O2 )

ed2: aaah, 10000 fmul in 50us ?? is 5ns for a float mul 🙂

no, optimizer kicked it out, because "useless" .

set: xyz arrays as "volatile" .

then test again.

If you feel a post has answered your question, please click "Accept as Solution".
AScha.3
Chief II

i just tried to make a test : fmul on a H743 , at 200MHz core, -O2 , caches on.

i used: DWT counter

https://embeddedb.blogspot.com/2013/10/how-to-count-cycles-on-arm-cortex-m.html

1 * fmul -> 15 tics , 75ns float mul. (5 tics, 25ns with data in cache)

1 * double mul -> 14 tics, 70 ns double mul

both : 95ns , in x10 loop : 755 ns

so (with some action by optimizer and cache ) float mul should be about 25...70ns

+ the float mul is 1 tic, the other time is load/store of the registers + wait states.

using the dsp lib seem useless for basic math, what should be faster than 1 clk execution?

If you feel a post has answered your question, please click "Accept as Solution".
IWhit.2
Associate II

Thanks for that AScha. Once I made the arrays volatile I got more reasonable results. 1415 us for the for next loop and 650 us for the arm library function. These were just test loops to ensure that the library was working correctly. Ofast gives slightly faster results than O3 but only marginal.