2010-04-21 11:34 AM
Duration of FLOAT operations
#stm32 #stm322011-05-17 04:48 AM
Depends : The operations, the numbers (0, +/-INF, NaN, etc), and frequency of the CPU.
You should benchmark your own code, using values typical for your application. Use the cycle counter in the Cortex-M3 trace unit, available to you in the STM32. On a CPU running at 72 MHz a cycle is about 14 ns. int cyc[2]; double x; // Joseph Yiu's method // From http://forums.arm.com/index.php?showtopic=13949 volatile unsigned int *DWT_CYCCNT = (volatile unsigned int *)0xE0001004; //address of the register volatile unsigned int *DWT_CONTROL = (volatile unsigned int *)0xE0001000; //address of the register volatile unsigned int *SCB_DEMCR = (volatile unsigned int *)0xE000EDFC; //address of the register *SCB_DEMCR = *SCB_DEMCR | 0x01000000; *DWT_CYCCNT = 0; // reset the counter *DWT_CONTROL = *DWT_CONTROL | 1 ; // enable the counter #define STOPWATCH_START { cyc[0] = *DWT_CYCCNT;} #define STOPWATCH_STOP { cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0]; } x = 10.0; STOPWATCH_START x *= 10.0; x *= 10.0; x *= 10.0; x *= 10.0; x *= 10.0; STOPWATCH_STOP printf(''%d cycles for 5 double multiplies\n'',cyc[1]);2011-05-17 04:48 AM
Thank you, Clive, for the example.
However, I thought that the limits (min, max) for particular operations are known. I am afraid, that CPU cycle is longer when executed from flash with some memory latency. Of course, I know, I can switch some GPIO pin and measure the time using scope.2011-05-17 04:48 AM
Unless I'm mistaken the cycle count is tied to the system clock, if the flash is adding wait state(s) they are burning cycles which are still accounted for in this register.
Contact IAR support if you want to know how their float library works.2011-05-17 04:48 AM
What dolezal.ivan is asking is how fast is ALR’s FP library. IAR floats are 32 bits. Perhaps using the hidden bit approach as done in the 8087 math coprocessor now integrated into modern x86 processors. (24 bits of precision decimal exponent ~10^(-44) to 10^38) I assume your accuracy / dynamic range requirements are met by this format.
dolezal.ivan and clive1: Yes the cycle count is system clock counts. At least on 72MHz Cortex-M3’s (two wait states requited) there is a prefetch that allows running code to execute without being impacted by wait states. Successful branches are delayed a little. Data fetches from flash also must wait but most code does not do that very often. Joseph Yiu's method should work but I never seem to have a device that can accept printf. Debugger and looking at registers or RAM works for me. More realistic code between start and stop might be performing a DTMF conversion using 100 points sampled at 8KHz. Be sure to run the second harmonics to weed out non DTMF sounds. AS I remember when I though I was going to get paid for such code a 200 point DTMF took 10 milliseconds. Thus 100 points should come in at about 6 milliseconds. (6 > 10/2 because there is a common tail after the accumulations.) Even better would be your FP application code between the start / stop pair. Note that addition may take longer than multiply.2011-05-17 04:48 AM
Yeah, I'd probably time something longer and more useful. Always better to avoid division or square roots. Like I said, time something useful/relavent to you, chances are there are algorithmic changes that can be more critical to improving timing.
The flash buffering on the STM32 is pretty effective, most SoC flash clocks in at around 30-35ns to read whatever bit-width the memory is implemented with. Without a printf I'd just use a hex output, or simple decimal decode in assembler, it's isn't that hard. The GPIO method is also good, I just find it easier to instrument the code.2011-05-17 04:48 AM
With h/w integer divide s/w FP mul & div come out the same save for the few extra clocks taken by h/w integet divide.
Sqrt use 4 loops of Newtons method after making power of 2 exponent even. (Mantissa will be in 0.5 .. 2.0) Time sqrt if needed in application.2011-05-17 04:48 AM
Division operations are generally slower, hardware assisted (~12 cycles) or not. The suggestion is to multiply with reciprocals where possible. The compiler may do this for you if it is obvious, but you know your algorithm better.
ie Instead of (x / 3) use (x * 1/3). Unit conversions. One assumes the author(s) of the float library are familiar with various square root methods and wrote them efficiently. I'm familiar with, and have implemented a few. The suggestion was to avoid them when you can. ie X > Y then it follows that sqrt(X) > sqrt(Y) for positive values, so equity and magnitude tests can be done without performing the operation(s) (one or both) when the result isn't consequential or propagated. Distance to closest neighbour. Here are some speed tables that purport to show the speed of an IAR floating point library. http://www.smxrtos.com/ussw/gofast/gofast_thumb2_iar.htm2011-05-17 04:48 AM
https://my.st.com/public/STe2ecommunities/mcu/Lists/ARM%20CortexM3%20STM32/Flat.aspx?RootFolder=https://my.st.com/public/STe2ecommunities/mcu/Lists/ARM CortexM3 STM32/Compilling with float&FolderCTID=0x01200200770978C69A1141439FE559EB459D758000626BE2B829C32145B9EB5739142DC17E¤tviews=4600
with interesting benchmarks done, two years ago. Cheers, STOne-2011-05-17 04:48 AM
Thank all of you for information and links.
I remember EWARM simulator so I have simulated the code below: float z, x = 1.23, y = 5.67E-3, u = 89.12E2, v = 3.45678; float oper(float a, float b, float c, float d) { return (a + b*c/d); } int main(void) { z = oper(x,y,u,v); return 0; } I did not find any stopwatch in the simulator but Function Profiler that shows: addition 30 cycles multiplication 26 cycles division 37 cycles