2019-12-11 08:16 AM
I am currently performing timing and performance measurements for the CMSIS/DSP FIR filter functions on an STM32F412. I toggle a GPIO before and after executing the arm_fir_f32() function and record the signal on a scope.
My results are:
Details: Sampling rate: 48kHz, Stereo channels, Signal buffer size to process: 64 per channel, CPU clock frequency: 100MHz. So the delay consists of 2 FIR filters with NUM_TAPS coefficients each.
My assumtion:
The FPU should be as performant as to process one multiply & accumulate in one clock cycle. Therefore I expect something in the range or 1 clock cycle per tap instead of 3-4.
My suspicion:
Looking at the assembly code (using Keil uVision 5) I can see two assembly instructions:
0x080008E8 EE628A0D VMUL.F32 s17,s4,s26
0x080008EC EE766AA8 VADD.F32 s13,s13,s17
which correspond to the following Line 465 in arm_fir_f32.c
acc3 += x4 * c0;
which is different from a single multiply & accumulate (VLMA) instruction.
Can somebody explain this or give context as to why it is not coded with the Cortex-M4 instruction VLMA?
https://developer.arm.com/docs/dui0553/a/the-cortex-m4-instruction-set/floating-point-instructions
Solved! Go to Solution.
2019-12-11 11:53 AM
According to the instruction set description, the VMLA.F32 instruction uses in fact 3 clock cycles - not 1!
Which would explain why the assembly code uses two separate (probably faster) instructions.
Souce: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html
This would then also explain, why the FIR filter uses more than 3 clock cycles per TAP.
2019-12-11 11:53 AM
According to the instruction set description, the VMLA.F32 instruction uses in fact 3 clock cycles - not 1!
Which would explain why the assembly code uses two separate (probably faster) instructions.
Souce: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html
This would then also explain, why the FIR filter uses more than 3 clock cycles per TAP.