2018-08-01 12:14 AM
I have been testing the performance of some maths functions on an STM32F407 CPU running at 168 MHz. From the documentation it is stated that the vmul.f32 (HW 32bit multiply instruction) should take 1 clock cycle to operate. Using my simple test program it looks like it is taking 2 clock cycles. The same test code implementing a "nop" or "add" (integer 32bit add) do appear to be taking 1 cycle so the test program looks ok, and the CPU's clock and FLASH configuration seems ok.
I am using a GCC compiler with my own build environment which has been in use for many years.
2018-08-04 12:15 AM
Ok, sorted this. I missed the note at the bottom of the ARM M4 documentations floating point cycle instruction table which says:
"Floating-point arithmetic data processing instructions, such as add, subtract, multiply, divide, square-root, all forms of multiply with accumulate, as well as conversions of all types take one cycle longer if their result is consumed by the following instruction."
My simple test code was executing a "vmul.f32 s16,s16,s16" continually and this takes two cycles per instruction.