2024-04-16 11:52 AM
This should be simple, but I can't find documentation. I am running an STM32L4P5 at max clock speed (120 MHz), and I have moved all code to SRAM so it is running with zero wait states. I have an extensive calculation to do using the FPU -- I am optimizing retained constants and such in the FPU registers.
I am wondering if any of the FPU instructions take longer than a single clock tick to execute. Divide? Square root? Are there any inserted wait states?
If there is a solid answer and it is in the documentation, please point to where it is docced. Thanks!
Solved! Go to Solution.
2024-04-16 12:51 PM
https://developer.arm.com/documentation/ddi0439/b/BEHJADED
2024-04-16 12:05 PM
A clock interrupting at KHz is not suitable for measuring nanoseconds of elapsed time.
SysTick is a 24-bit Down Counter, typically at 1/8th the MCU clock.
For precise cycle counts use the DWT's CYCCNT instead. It's a full range 32-bit Up Counter so easy to delta the counts.
Not sure I have a table of FPU cycles, but it runs concurrently with the MCU and has its own pipeline. So would be more helpful to looks at throughput. Although you could probably make something with chained dependency if you want to worst case it.
2024-04-16 12:41 PM - edited 2024-04-16 12:44 PM
Instruction timing on ARM cores ...not so easy to find. :)
example : on M7 , double float, Square root : 28...30 ticks
+ on M7 , single float , Square root : 14 ticks
see: https://www.quinapalus.com/cm7cycles.html
+ on M4 , single float , Square root : 14 ticks
see: https://www.engr.scu.edu/~dlewis/book3/docs/Cortex-M4%20Proc%20Tech%20Ref%20Manual.pdf
chapter 7.2.3 .
+ Cortex-M
https://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf
+ ...
2024-04-16 12:51 PM
https://developer.arm.com/documentation/ddi0439/b/BEHJADED
2024-04-16 02:54 PM
You guys are great. This is what I expected from my benchmarking on the scope. I was suspecting that coding the expression: (A*B*C)/(D*E*F) is far more efficient being coded as:
ANS = A*B, ANS = ANS*C, S0 = D*E, S0 = S0*F, ANS = ANS/S0
than by
ANS = A*B, ANS = ANS*C, ANS = ANS/D, ANS = ANS/E, ANS = ANS/F
f32 multiplies take 1 clock tick, f32 divides take 14 clock ticks, so clearly the former is faster.
If you either of you pass through Cedar City, Utah, I'll buy you a beer. Thanks! Jeff
2024-04-16 02:58 PM
Another question on the same topic. If the FPU 32 divide and square root instructions take 14 clock ticks, and the FPU is a true co-processor with a separate pipeline, does that mean that I can launch a divide (or square root) and do 14 ticks worth of instructions without the CPU stalling, as long as none of those instructions use the FPU?
2024-04-16 03:02 PM
Never mind, the same documentation page answered that second question in the footnotes -- they DO proceed in parallel, apparently. I'll have to interleave FPU and non-FPU functions cleverly to get more speed.