Reading http://www.iar.com/Global/Resources/Developers_Toolbox/C_Cplusplus_Programming/Improve_performance_of_digital_signal_processing_with_IAR_Embedded_Workbench_for_ARM.pdf

page 3 :

arm_sqrt_f32 : 52 cycles

sqrt : 752 cycles

The cycles count are good. BUT sqrt works on 64 bits double precision floating point number.

Try use sqrtf instead : armcc will use VSQRT instruction, during 14 cycles (25-28 cycles including function call) => sqrtf is 2 time faster than the DSP iterative algorithm.

page 3 :

arm_sqrt_f32 : 52 cycles

sqrt : 752 cycles

The cycles count are good. BUT sqrt works on 64 bits double precision floating point number.

Try use sqrtf instead : armcc will use VSQRT instruction, during 14 cycles (25-28 cycles including function call) => sqrtf is 2 time faster than the DSP iterative algorithm.

They used sqrt() as a software FP solution, because the sqrtf() would likely use the FPU in any reasonably selected library. Try timing a software sqrtf() implementation vs an FPU assisted one.