Do all FPU operations proceed in a single SYSCLK tick?

JCase.1 · ‎2024-04-16

This should be simple, but I can't find documentation. I am running an STM32L4P5 at max clock speed (120 MHz), and I have moved all code to SRAM so it is running with zero wait states. I have an extensive calculation to do using the FPU -- I am optimizing retained constants and such in the FPU registers.

I am wondering if any of the FPU instructions take longer than a single clock tick to execute. Divide? Square root? Are there any inserted wait states?

If there is a solid answer and it is in the documentation, please point to where it is docced. Thanks!

Tesla DeLorean · ‎2024-04-16

https://developer.arm.com/documentation/ddi0439/b/BEHJADED

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

View solution in original post

Tesla DeLorean · ‎2024-04-16

A clock interrupting at KHz is not suitable for measuring nanoseconds of elapsed time.

SysTick is a 24-bit Down Counter, typically at 1/8th the MCU clock.

For precise cycle counts use the DWT's CYCCNT instead. It's a full range 32-bit Up Counter so easy to delta the counts.

Not sure I have a table of FPU cycles, but it runs concurrently with the MCU and has its own pipeline. So would be more helpful to looks at throughput. Although you could probably make something with chained dependency if you want to worst case it.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

AScha.3 · ‎2024-04-16

Instruction timing on ARM cores ...not so easy to find. :)

example : on M7 , double float, Square root : 28...30 ticks

+ on M7 , single float , Square root : 14 ticks

see: https://www.quinapalus.com/cm7cycles.html

+ on M4 , single float , Square root : 14 ticks

see: https://www.engr.scu.edu/~dlewis/book3/docs/Cortex-M4%20Proc%20Tech%20Ref%20Manual.pdf

chapter 7.2.3 .

+ Cortex-M

https://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf

+ ...

https://developer.arm.com/Architectures#aq=%40navigationhierarchiescategories%3D%3D%22Architecture%20products%22%20AND%20%40navigationhierarchiescontenttype%3D%3D%22Product%20Information%22&numberOfResults=48&f-navigationhierarchiesprocessortype=Instruction%20Set%20Architectures

If you feel a post has answered your question, please click "Accept as Solution".

Tesla DeLorean · ‎2024-04-16

https://developer.arm.com/documentation/ddi0439/b/BEHJADED

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

JCase.1 · ‎2024-04-16

You guys are great. This is what I expected from my benchmarking on the scope. I was suspecting that coding the expression: (A*B*C)/(D*E*F) is far more efficient being coded as:

ANS = A*B, ANS = ANS*C, S0 = D*E, S0 = S0*F, ANS = ANS/S0

than by

ANS = A*B, ANS = ANS*C, ANS = ANS/D, ANS = ANS/E, ANS = ANS/F

f32 multiplies take 1 clock tick, f32 divides take 14 clock ticks, so clearly the former is faster.

If you either of you pass through Cedar City, Utah, I'll buy you a beer. Thanks! Jeff

JCase.1 · ‎2024-04-16

Another question on the same topic. If the FPU 32 divide and square root instructions take 14 clock ticks, and the FPU is a true co-processor with a separate pipeline, does that mean that I can launch a divide (or square root) and do 14 ticks worth of instructions without the CPU stalling, as long as none of those instructions use the FPU?

JCase.1 · ‎2024-04-16

Never mind, the same documentation page answered that second question in the footnotes -- they DO proceed in parallel, apparently. I'll have to interleave FPU and non-FPU functions cleverly to get more speed.