CMSIS DSP Library performance

IWhit.2 · ‎2022-12-04

I'm using the STM32F769DI and the SMT32cCube for a project which will be using the CMSIS DSP libraries in a computationally demanding application.

My C code is set for fast optimisation. __FPU_USED and FPU_PRESENT are set. ARM_MATH_CM7 is defined. I compare two blocks of simple float32 multiplication using the CMSIS library and a simply loop. I'm linking in the arm_cortexM7lfdp_math library from the GCC directory. I can check the timing with a scope on LED2.

// 711 us

BSP_LED_On(LED2);

arm_mult_f32(x, y, z, 10000);

BSP_LED_Off(LED2);

// 50 us

BSP_LED_On(LED2);

for (int k = 0; k < 10000; k++) {

z[k] = x[k] * y[k];

}

BSP_LED_Off(LED2);

Does anyone have any insight into why the simple for loop should be so much faster than the SIMD based CMSIS library. Is there some additional initialisation necessary? That I've missed. I've tried compiling the CMSIS library into my own library but realised pretty much the same results?

AScha.3 · ‎2022-12-04

what faster ? i see no values...

ed: ah, time in comment. 1: 10 speed? seems something wrong.

and most important: optimizer setting ? (try: -O2 )

ed2: aaah, 10000 fmul in 50us ?? is 5ns for a float mul :)

no, optimizer kicked it out, because "useless" .

set: xyz arrays as "volatile" .

then test again.

If you feel a post has answered your question, please click "Accept as Solution".

View solution in original post

AScha.3 · ‎2022-12-04

what faster ? i see no values...

ed: ah, time in comment. 1: 10 speed? seems something wrong.

and most important: optimizer setting ? (try: -O2 )

ed2: aaah, 10000 fmul in 50us ?? is 5ns for a float mul :)

no, optimizer kicked it out, because "useless" .

set: xyz arrays as "volatile" .

then test again.

If you feel a post has answered your question, please click "Accept as Solution".

AScha.3 · ‎2022-12-05

i just tried to make a test : fmul on a H743 , at 200MHz core, -O2 , caches on.

i used: DWT counter

https://embeddedb.blogspot.com/2013/10/how-to-count-cycles-on-arm-cortex-m.html

1 * fmul -> 15 tics , 75ns float mul. (5 tics, 25ns with data in cache)

1 * double mul -> 14 tics, 70 ns double mul

both : 95ns , in x10 loop : 755 ns

so (with some action by optimizer and cache ) float mul should be about 25...70ns

+ the float mul is 1 tic, the other time is load/store of the registers + wait states.

using the dsp lib seem useless for basic math, what should be faster than 1 clk execution?

If you feel a post has answered your question, please click "Accept as Solution".

IWhit.2 · ‎2022-12-06

Thanks for that AScha. Once I made the arrays volatile I got more reasonable results. 1415 us for the for next loop and 650 us for the arm library function. These were just test loops to ensure that the library was working correctly. Ofast gives slightly faster results than O3 but only marginal.

BarryWhit · ‎2024-09-02

@AScha.3 , this is an old thread so you may not remember but... what does volatile have to do with anything? can you explain why it's needed?

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.

AScha.3 · ‎2024-09-03

Ok, i try...

1. Here - on F7 /H7 - core has D-cache, so any modificatian to a variable (in INT or by DMA ) will make a new value on this variable, but if core has this already in its cache or registers, he dont know about and will use old/cached value ! So you have to "tell" the core/cache , to use the variable in RAM , not the cached one.

2. If optimizer is doing his job, it will check more (than without optimizing -O0 ) about "sense and purpose" of any statement; so a simple delay or a speed test loop ( 1000 multiplications in a loop or so) , where the loop variable or results are not used any more in the program, because you just want check the used time for this loop, the optimizer says: aha, result or variables no more used in program after this, so no sense to do this - and kicks it away. -> no test, no delay !

By setting the values with qualifier "volatile" , you tell the compiler: this variable might change unexpected (for the core) and always has to be used as variable in RAM. So optimizer will leave also "useless" statements or loops in code, because its told now to keep/use it. Or changing the variable in INT will do, what you expect: its changed globally.

+ Just be aware, optimzed code without "volatile" will be little bit faster, because the optimizer will try to use core registers for variables and/or cached values. (IF code/result is used ... otherwise its away. )

But for just checking, how fast is this xxx compared to another way to calucalete this xxx, its ok ;

If you feel a post has answered your question, please click "Accept as Solution".

BarryWhit · ‎2024-09-03

Oh, I know about volatile and what it does. I just don't understand how this could have solved this particular situation. It seems completely unrelated.

Update: Unless it's actually the for loop that seemed artificially "fast" because the compiler eliminated it completely? But then, what consumed 50us of time?

Update Update: OK, I reread your old comment and I understand that's what you were saying.

Thanks!

- If someone's post helped resolve your issue, please thank them by clicking "Accept as Solution".
- Please post an update with details once you've solved your issue. Your experience may help others.

AScha.3 · ‎2024-09-03

Right - not using the result -> optimizer thinks: this is useless....away it is. :)

With "volatile" you force him to use/keep the variable. Thats all.

If you feel a post has answered your question, please click "Accept as Solution".