cancel
Showing results for 
Search instead for 
Did you mean: 

Do all FPU operations proceed in a single SYSCLK tick?

JCase.1
Associate III

This should be simple, but I can't find documentation.   I am running an STM32L4P5 at max clock speed (120 MHz), and I have moved all code to SRAM so it is running with zero wait states.    I have an extensive calculation to do using the FPU -- I am optimizing retained constants and such in the FPU registers.  

I am wondering if any of the FPU instructions take longer than a single clock tick to execute.   Divide?  Square root?  Are there any inserted wait states?

If there is a solid answer and it is in the documentation, please point to where it is docced.    Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

https://developer.arm.com/documentation/ddi0439/b/BEHJADED

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

View solution in original post

6 REPLIES 6

A clock interrupting at KHz is not suitable for measuring nanoseconds of elapsed time.

SysTick is a 24-bit Down Counter, typically at 1/8th the MCU clock.

For precise cycle counts use the DWT's CYCCNT instead. It's a full range 32-bit Up Counter so easy to delta the counts.

Not sure I have a table of FPU cycles, but it runs concurrently with the MCU and has its own pipeline. So would be more helpful to looks at throughput. Although you could probably make something with chained dependency if you want to worst case it.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
AScha.3
Chief II

Instruction timing on ARM cores ...not so easy to find. 🙂

example : on M7 , double float,  Square root : 28...30 ticks

AScha3_0-1713295502508.png

+ on M7 , single float , Square root : 14 ticks

AScha3_2-1713296653453.png

see: https://www.quinapalus.com/cm7cycles.html

+ on M4 , single float , Square root : 14 ticks

AScha3_1-1713296049976.png

see: https://www.engr.scu.edu/~dlewis/book3/docs/Cortex-M4%20Proc%20Tech%20Ref%20Manual.pdf

chapter 7.2.3 .

+ Cortex-M

https://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf

+ ...

https://developer.arm.com/Architectures#aq=%40navigationhierarchiescategories%3D%3D%22Architecture%20products%22%20AND%20%40navigationhierarchiescontenttype%3D%3D%22Product%20Information%22&numberOfResults=48&f-navigationhierarchiesprocessortype=Instruction%20Set%20Architectures

 

If you feel a post has answered your question, please click "Accept as Solution".

https://developer.arm.com/documentation/ddi0439/b/BEHJADED

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
JCase.1
Associate III

You guys are great.    This is what I expected from my benchmarking on the scope.   I was suspecting that coding the expression:      (A*B*C)/(D*E*F) is far more efficient being coded as:

              ANS = A*B,   ANS = ANS*C,     S0 = D*E,   S0 = S0*F,      ANS = ANS/S0

than by

              ANS = A*B,    ANS = ANS*C,     ANS = ANS/D,    ANS = ANS/E,    ANS = ANS/F

f32 multiplies take 1 clock tick, f32 divides take 14 clock ticks, so clearly the former is faster.

If you either of you pass through Cedar City, Utah, I'll buy you a beer.   Thanks!    Jeff

 

          

JCase.1
Associate III

Another question on the same topic.     If the FPU 32 divide and square root instructions take 14 clock ticks, and the FPU is a true co-processor with a separate pipeline, does that mean that I can launch a divide (or square root) and do 14 ticks worth of instructions without the CPU stalling, as long as none of those instructions use the FPU?

JCase.1
Associate III

Never mind, the same documentation page answered that second question in the footnotes -- they DO proceed in parallel, apparently.   I'll have to interleave FPU and non-FPU functions cleverly to get more speed.