Why is CMSIS/DSP arm_fir_f32() using more than one clock cycle per tap? (not using VLMA instruction)

Mnemocron · ‎2019-12-11

I am currently performing timing and performance measurements for the CMSIS/DSP FIR filter functions on an STM32F412. I toggle a GPIO before and after executing the arm_fir_f32() function and record the signal on a scope.

My results are:

Details: Sampling rate: 48kHz, Stereo channels, Signal buffer size to process: 64 per channel, CPU clock frequency: 100MHz. So the delay consists of 2 FIR filters with NUM_TAPS coefficients each.

My assumtion:

The FPU should be as performant as to process one multiply & accumulate in one clock cycle. Therefore I expect something in the range or 1 clock cycle per tap instead of 3-4.

My suspicion:

Looking at the assembly code (using Keil uVision 5) I can see two assembly instructions:

0x080008E8 EE628A0D VMUL.F32 s17,s4,s26
0x080008EC EE766AA8 VADD.F32 s13,s13,s17

which correspond to the following Line 465 in arm_fir_f32.c

acc3 += x4 * c0;

which is different from a single multiply & accumulate (VLMA) instruction.

Can somebody explain this or give context as to why it is not coded with the Cortex-M4 instruction VLMA?

https://developer.arm.com/docs/dui0553/a/the-cortex-m4-instruction-set/floating-point-instructions

Mnemocron · ‎2019-12-11

According to the instruction set description, the VMLA.F32 instruction uses in fact 3 clock cycles - not 1!

Which would explain why the assembly code uses two separate (probably faster) instructions.

Souce: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html

This would then also explain, why the FIR filter uses more than 3 clock cycles per TAP.

View solution in original post

Mnemocron · ‎2019-12-11

According to the instruction set description, the VMLA.F32 instruction uses in fact 3 clock cycles - not 1!

Which would explain why the assembly code uses two separate (probably faster) instructions.

Souce: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html

This would then also explain, why the FIR filter uses more than 3 clock cycles per TAP.