Duration of FLOAT operations

ivan239955_stm1_stmicro_com · ‎2010-04-21

Posted on April 21, 2010 at 20:34

#stm32 #stm32

Tesla DeLorean · ‎2011-05-17

Posted on May 17, 2011 at 13:48

Depends : The operations, the numbers (0, +/-INF, NaN, etc), and frequency of the CPU.

You should benchmark your own code, using values typical for your application.

Use the cycle counter in the Cortex-M3 trace unit, available to you in the STM32. On a CPU running at 72 MHz a cycle is about 14 ns.

int cyc[2];

double x;

// Joseph Yiu's method

// From http://forums.arm.com/index.php?showtopic=13949

volatile unsigned int *DWT_CYCCNT = (volatile unsigned int *)0xE0001004; //address of the register

volatile unsigned int *DWT_CONTROL = (volatile unsigned int *)0xE0001000; //address of the register

volatile unsigned int *SCB_DEMCR = (volatile unsigned int *)0xE000EDFC; //address of the register

*SCB_DEMCR = *SCB_DEMCR | 0x01000000;

*DWT_CYCCNT = 0; // reset the counter

*DWT_CONTROL = *DWT_CONTROL | 1 ; // enable the counter

#define STOPWATCH_START { cyc[0] = *DWT_CYCCNT;}

#define STOPWATCH_STOP { cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0]; }

x = 10.0;

STOPWATCH_START

x *= 10.0;

STOPWATCH_STOP

printf(''%d cycles for 5 double multiplies\n'',cyc[1]);

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

ivan239955_stm1_stmicro_com · ‎2011-05-17

Posted on May 17, 2011 at 13:48

Thank you, Clive, for the example.

However, I thought that the limits (min, max) for particular operations are known.

I am afraid, that CPU cycle is longer when executed from flash with some memory latency.

Of course, I know, I can switch some GPIO pin and measure the time using scope.

Tesla DeLorean · ‎2011-05-17

Posted on May 17, 2011 at 13:48

Unless I'm mistaken the cycle count is tied to the system clock, if the flash is adding wait state(s) they are burning cycles which are still accounted for in this register.

Contact IAR support if you want to know how their float library works.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

picguy2 · ‎2011-05-17

Posted on May 17, 2011 at 13:48

What dolezal.ivan is asking is how fast is ALR’s FP library. IAR floats are 32 bits. Perhaps using the hidden bit approach as done in the 8087 math coprocessor now integrated into modern x86 processors. (24 bits of precision decimal exponent ~10^(-44) to 10^38) I assume your accuracy / dynamic range requirements are met by this format.

dolezal.ivan and clive1: Yes the cycle count is system clock counts. At least on 72MHz Cortex-M3’s (two wait states requited) there is a prefetch that allows running code to execute without being impacted by wait states. Successful branches are delayed a little. Data fetches from flash also must wait but most code does not do that very often.

Joseph Yiu's method should work but I never seem to have a device that can accept printf. Debugger and looking at registers or RAM works for me.

More realistic code between start and stop might be performing a DTMF conversion using 100 points sampled at 8KHz. Be sure to run the second harmonics to weed out non DTMF sounds. AS I remember when I though I was going to get paid for such code a 200 point DTMF took 10 milliseconds. Thus 100 points should come in at about 6 milliseconds. (6 > 10/2 because there is a common tail after the accumulations.)

Even better would be your FP application code between the start / stop pair. Note that addition may take longer than multiply.

Tesla DeLorean · ‎2011-05-17

Posted on May 17, 2011 at 13:48

Yeah, I'd probably time something longer and more useful. Always better to avoid division or square roots. Like I said, time something useful/relavent to you, chances are there are algorithmic changes that can be more critical to improving timing.

The flash buffering on the STM32 is pretty effective, most SoC flash clocks in at around 30-35ns to read whatever bit-width the memory is implemented with.

Without a printf I'd just use a hex output, or simple decimal decode in assembler, it's isn't that hard.

The GPIO method is also good, I just find it easier to instrument the code.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

picguy2 · ‎2011-05-17

Posted on May 17, 2011 at 13:48

With h/w integer divide s/w FP mul & div come out the same save for the few extra clocks taken by h/w integet divide.

Sqrt use 4 loops of Newtons method after making power of 2 exponent even. (Mantissa will be in 0.5 .. 2.0) Time sqrt if needed in application.

Tesla DeLorean · ‎2011-05-17

Posted on May 17, 2011 at 13:48

Division operations are generally slower, hardware assisted (~12 cycles) or not. The suggestion is to multiply with reciprocals where possible. The compiler may do this for you if it is obvious, but you know your algorithm better.

ie Instead of (x / 3) use (x * 1/3). Unit conversions.

One assumes the author(s) of the float library are familiar with various square root methods and wrote them efficiently. I'm familiar with, and have implemented a few. The suggestion was to avoid them when you can.

ie X > Y then it follows that sqrt(X) > sqrt(Y) for positive values, so equity and magnitude tests can be done without performing the operation(s) (one or both) when the result isn't consequential or propagated. Distance to closest neighbour.

Here are some speed tables that purport to show the speed of an IAR floating point library. http://www.smxrtos.com/ussw/gofast/gofast_thumb2_iar.htm

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Nickname12657_O · ‎2011-05-17

Posted on May 17, 2011 at 13:48

Hi,

You can refer to this

https://my.st.com/public/STe2ecommunities/mcu/Lists/ARM%20CortexM3%20STM32/Flat.aspx?RootFolder=https://my.st.com/public/STe2ecommunities/mcu/Lists/ARM CortexM3 STM32/Compilling with float&FolderCTID=0x01200200770978C69A1141439FE559EB459D758000626BE2B829C32145B9EB5739142DC17E&currentviews=4600

with interesting benchmarks done, two years ago.

Cheers,

STOne-

ivan239955_stm1_stmicro_com · ‎2011-05-17

Posted on May 17, 2011 at 13:48

Thank all of you for information and links.

I remember EWARM simulator so I have simulated the code below:

float z, x = 1.23, y = 5.67E-3, u = 89.12E2, v = 3.45678;

float oper(float a, float b, float c, float d)

{

return (a + b*c/d);

}

int main(void)

{

z = oper(x,y,u,v);

return 0;

}

I did not find any stopwatch in the simulator but Function Profiler that shows:

addition 30 cycles

multiplication 26 cycles

division 37 cycles