Fast floating point math

linas2 · ‎2013-04-03

Posted on April 03, 2013 at 11:27

Hello, i run in to a problem, where i need to do math with single precision floating point.

it looks like if(...) condition, and printing floating point to uint16_t is very slow.

so i have two questions:

first, what is the fastest way to print float to integer ? ( float will be between -10 and +10, and that should correspond to 0x0000 and 0xFFFF in integer.)

and second, how can i read binary data from float variable to u32 variables, so i can make if(...) condition faster (i need to set floating point to zero, if it is more than 10.0 or less that -10.0

( i know that floating point variable is in memory with adress of 0x2000000C so all i need to do is to copy raw binary data to u32 variable, and check mantissa and exponent to see if it goes off the limit, but i don't know how to copy 32b data to 32b unsigned integer, if i try to do that with pointers i get error, or compiler do conversion for me, and slow my program)

frankmeyer9 · ‎2013-04-04

Posted on April 04, 2013 at 09:03

what else i can use ? it's only one bite longer than i need

Because anything shorter than int (32 bit) is less efficient, requiring additional saturation/sign extension instructions. Better use uint32_t instead.

I know what a DFT/DCT is, and I know the difference between a PI and PID controller.

But I still don't get what exactly you need the phase of the (which ?) modulation frequency for.

Which doesn't mean your idea isn't workable ...

At least the code looks like using the 'hard' ABI, so no time spent in function calls.

linas2 · ‎2013-04-04

Posted on April 04, 2013 at 12:35

it is laser stabilization loop, and it works very well (just tested)

u16 and u32 is slower , only u8 is fast.

I don't know about stm32f4 core, but some dsp can take several commands at one cycle, like 128b for commands, so less for counter means more job/cycle

now i am running at 58KHz, and just theoretical FIFO read speed is 62KHz, so i have quite a good and efficient code, it still needs to calculate angle and check phase and run that to PI engine and print it do DAC .

frankmeyer9 · ‎2013-04-04

Posted on April 04, 2013 at 12:50

u16 and u32 is slower , only u8 is fast.

Normally not.

In normal cases, non-register sized operands require additional saturation and sign-extension instructions, before used in operations.

Your 'odd' instruction timing might be caused by cache misses (ART) and pipeline stalls/flushes.

I can't give you any definitive hint, except experimenting.

BTW, I remember a Goebel algorithm to calculate just one frequency point out of the whole spectre, which could be more efficient in your case. But I just can't find any reference at the moment ...

Tesla DeLorean · ‎2013-04-04

Posted on April 04, 2013 at 13:37

u16 and u32 is slower , only u8 is fast.

This isn't true for ARM, there are extra instructions in your core loop to handle this. In the x86 this causes huge problems tracking fragments of registers. Use the native register size whenever possible.

The compiler is also doing a hopeless job of keeping values in registers, it's putting all your variables in memory, and doing a lot of unnecessary, and slow, read/writes.

Not sure you need to use EOR to toggle the clock, you can use both edges of the clock, and unfold the loop once.

You also need to get that spin loop out of there, you want the FPGA to buffer enough data for a burst. At the very least you should be able to place a clock edge, and read data back, without spinning.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

dthedens23 · ‎2013-04-04

Posted on April 04, 2013 at 18:08

Do you think that this could be improved?

while(GPIOA->IDR <
32766
);

first, it isobfuscated. second, would a single bit band address to the bit be a faster access? not that it matters much, because one is just waiting for the bit to change.

frankmeyer9 · ‎2013-04-04

Posted on April 04, 2013 at 18:46

And I'm still not convinced that you need float calculations.

Even 12 bit ADC data and a 12 bit sinus/cosinus table will not cause an overflow for 128 point DFT with int32. And I'm sure you can reduce the resolution without compromising accuracy for your ''

phase<10

'' comparision.

And integer instructions take less cycles than FPU instructions.