Fast floating point math

linas2 · ‎2013-04-03

Posted on April 03, 2013 at 11:27

Hello, i run in to a problem, where i need to do math with single precision floating point.

it looks like if(...) condition, and printing floating point to uint16_t is very slow.

so i have two questions:

first, what is the fastest way to print float to integer ? ( float will be between -10 and +10, and that should correspond to 0x0000 and 0xFFFF in integer.)

and second, how can i read binary data from float variable to u32 variables, so i can make if(...) condition faster (i need to set floating point to zero, if it is more than 10.0 or less that -10.0

( i know that floating point variable is in memory with adress of 0x2000000C so all i need to do is to copy raw binary data to u32 variable, and check mantissa and exponent to see if it goes off the limit, but i don't know how to copy 32b data to 32b unsigned integer, if i try to do that with pointers i get error, or compiler do conversion for me, and slow my program)

dthedens23 · ‎2013-04-03

Posted on April 03, 2013 at 18:14

You still have double precision floating point constants and magic numbers everywhere.

why calculate both cosine and sine? isn't there a relationship so that if I know one, I know the other?

but more important, what you want to do can be done with 32bit math, lookup tables for sine/cosine, and attention to detail.

Or, it you insist on floating point, get a STMF4 CPU.

linas2 · ‎2013-04-03

Posted on April 03, 2013 at 18:40

From: rocketdawg
You still have double precision floating point constants and magic numbers everywhere.
why calculate both cosine and sine? isn't there a relationship so that if I know one, I know the other?
but more important, what you want to do can be done with 32bit math, lookup tables for sine/cosine, and attention to detail.

I am using lookup tables for cosf and sinf, and i need them because it is code for single point DFT. and yes i am using STM32F407VG with FPU enabled. But you are right about constant, i added f to all constants, 0 replaced by 0.0f, now code runs at 53KHz,. theoretical limit for this loop is ~61KHz, so i guess that would be enough. Also replacing loop counter variable i to u8 also helped to gain speed by 5KHz

dthedens23 · ‎2013-04-03

Posted on April 03, 2013 at 19:43

which compiler?

do you actually see the FPU instructions in the diss-assembly?

linas2 · ‎2013-04-03

Posted on April 03, 2013 at 20:16

From: rocketdawg
which compiler?

i am using IAR 6.40, now i will try to download new one 6.5, it says it is faster :) the only place i can save time now is in main loop. Load clock for fifo is 8MHz, unload is 7.28MHz, i need something faster than software pooling to do the job ( i disable all unnecessary instructions in that loop so it run at maximum speed)

while(GPIOC->IDR < 32766);

I know that in timer example is example with trigger, can i use it somehow ? and if yes, how ? ( i already unloaded all timing job to FPGA to gain speed, only few things is left for STM32F4)

Tesla DeLorean · ‎2013-04-03

Posted on April 03, 2013 at 20:34

Stop using the C compiler as a crutch, code the critical loops in assembler. At the very least look a what the compiler is generating and tweak that, heck even attach to a post.

I still don't understand why if you've got an FPGA you need to fiddle around with the GPIO ports, wait on multiple clock edges, generate bit banged clocks going high and low. Surely this can be achieved with the FSMC bus accessing peripheral registers on the FPGA, on a bus that should be able to yield 30-60 MBps burst rates.

I can't believe that doing a bit test is slower than seeing if something is 0x8000 (32768) or not for bit 15 of the bus.

Why not post the assembler generated now?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

frankmeyer9 · ‎2013-04-03

Posted on April 03, 2013 at 20:36

I believe rocketdawg is right - you still have a lot of double constants.

This makes a difference if not having some compiler flag ''treat doubles as floats''.

Why using a DFT in the first place ?

Have you thought about reducing the sampling frequency, i.e. reducing the number of loops ?

linas2 · ‎2013-04-03

Posted on April 03, 2013 at 21:00

I have linear 128 point fotodiode array (128 pixels) running at 8MHz read speed for pixel. Also have ADC board controlled by FPGA to make A to D conversion at 16b resolution.

ADC boar is connected to FIFO again controlled by fpga, it takes cares of pointer, adc 7 cycles delay, and stuff like that. STM3F4 only say ''i want to start read detector'' by setting one bit in GPIO. FPGA generate right signals for detector, ADC, reset fifo pointer, and hold fifo buffer at reset state while right pixel data is go out from adc, when enables fifo and FPGA job is done. stm32f4 reads fifo buffer and make single point DFT while still reading data from fifo buffer, that is the fastest way because i can use sygnal decomposition for spectrum calculation, this is DSP stuff. when i calculate atan2 to get phase of detector modulation frequency , and do simple PI controller stuff to stabilise that modulation (like PID just without D ) C code:

while(i<
128
)
{
cosinusas[i]=(float32_t)cosf((6.28318531f*i*POINT)/N);
sinusas[i] =(float32_t)sinf((6.28318531f*i*POINT)/N);
i++;
}
while(1)
{
while(GPIOA->IDR < 
32766
);
GPIOD->BSRRL= GPIO_Pin_13; 
while(GPIOA->IDR > 32766);
while(GPIOA->IDR < 
32766
);
GPIOD->BSRRH= GPIO_Pin_13;
imag=real=i=0; 
while(i<
128
)
{
CLK_LOW;
while(GPIOC->IDR < 
32766
); 
CLK_HIGH;
k
=
GPIOB
->IDR;
real=real+k*cosinusas[i];
imag=imag-k*sinusas[i];
i++;
} 
faze=faze+(0.2f*Faze(imag,real));
if(fabsf(faze)>9.372f)
faze=0.0f;
DAC_DATA=(int)(faze*375f+3270f);
}

asm code looks like this if some one understands it:

??main_0:
0x8000424: 0x4958 LDR.N R1, ??DataTable3_15 ; GPIOA_IDR
0x8000426: 0xf647 0x70fe MOVW R0, #32766 ; 0x7ffe
0x800042a: 0x680a LDR R2, [R1]
0x800042c: 0x4282 CMP R2, R0
0x800042e: 0xd3f9 BCC.N ??main_0 ; 0x8000424
GPIOD->BSRRL= GPIO_Pin_13; 
0x8000430: 0xf44f 0x5200 MOV.W R2, #8192 ; 0x2000
0x8000434: 0xf8a1 0x2c08 STRH.W R2, [R1, #0xc08]
0x8000438: 0xf647 0x73ff MOVW R3, #32767 ; 0x7fff
while(GPIOA->IDR > 32766);
??main_2:
0x800043c: 0x680d LDR R5, [R1]
0x800043e: 0x429d CMP R5, R3
0x8000440: 0xd2fc BCS.N ??main_2 ; 0x800043c
while(GPIOA->IDR < 
32766
);
??main_3:
0x8000442: 0x680b LDR R3, [R1]
0x8000444: 0x4283 CMP R3, R0
0x8000446: 0xd3fc BCC.N ??main_3 ; 0x8000442
GPIOD->BSRRH= GPIO_Pin_13;
0x8000448: 0xf8a1 0x2c0a STRH.W R2, [R1, #0xc0a]
imag=real=i=0; 
0x800044c: 0x2200 MOVS R2, #0
0x800044e: 0x7022 STRB R2, [R4]
0x8000450: 0x6062 STR R2, [R4, #0x4]
0x8000452: 0x60a2 STR R2, [R4, #0x8]
0x8000454: 0xf504 0x7204 ADD.W R2, R4, #528 ; 0x210
0x8000458: 0xf104 0x0310 ADD.W R3, R4, #16 ; 0x10
0x800045c: 0xf44f 0x6500 MOV.W R5, #2048 ; 0x800
CLK_LOW;
??main_4:
0x8000460: 0xf8a1 0x5c0a STRH.W R5, [R1, #0xc0a]
while(GPIOC->IDR < 
32766
); 
??main_5:
0x8000464: 0xf8d1 0x6800 LDR.W R6, [R1, #0x800]
0x8000468: 0x4286 CMP R6, R0
0x800046a: 0xd3fb BCC.N ??main_5 ; 0x8000464
CLK_HIGH;
0x800046c: 0xf8a1 0x5c08 STRH.W R5, [R1, #0xc08]
k
=
GPIOB
->IDR;
0x8000470: 0xf8d1 0x6400 LDR.W R6, [R1, #0x400]
0x8000474: 0xedd4 0x0a01 VLDR S1, [R4, #4]
0x8000478: 0x8066 STRH R6, [R4, #0x2]
0x800047a: 0x8866 LDRH R6, [R4, #0x2]
0x800047c: 0xed92 0x1a00 VLDR S2, [R2]
0x8000480: 0xee00 0x6a10 VMOV S0, R6
0x8000484: 0xeeb8 0x0a40 VCVT.FU32 S0, S0
0x8000488: 0xee40 0x0a01 VMLA.F32 S1, S0, S2
i++;
0x800048c: 0x7826 LDRB R6, [R4]
0x800048e: 0xed94 0x1a02 VLDR S2, [R4, #8]
0x8000492: 0xedc4 0x0a01 VSTR S1, [R4, #4]
0x8000496: 0x1c76 ADDS R6, R6, #1
0x8000498: 0xedd3 0x1a00 VLDR S3, [R3]
0x800049c: 0x7026 STRB R6, [R4]
0x800049e: 0xee00 0x1a61 VMLS.F32 S2, S0, S3
0x80004a2: 0x1d1b ADDS R3, R3, #4
0x80004a4: 0x1d12 ADDS R2, R2, #4
while(i<
128
)
0x80004a6: 0xb2f6 UXTB R6, R6
0x80004a8: 0xed84 0x1a02 VSTR S2, [R4, #8]
0x80004ac: 0x2e80 CMP R6, #128 ; 0x80
0x80004ae: 0xd3d7 BCC.N ??main_4 ; 0x8000460
if(y>0.0f)
0x80004b0: 0xeeb5 0x1a40 VCMP.F32 S2, #0.0
0x80004b4: 0xed9f 0x0a28 VLDR S0, ??DataTable3_3 ; [0x8000558]
0x80004b8: 0xeef1 0xfa10 VMRS APSR_nzcv, FPSCR
0x80004bc: 0xbfce ITEE GT
0x80004be: 0xee31 0x0a00 VADDGT.F32 S0, S2, S0
0x80004c2: 0xeeff 0x1a00 VMOVLE.F32 S3, #-1
0x80004c6: 0xee01 0x0a21 VMLALE.F32 S0, S2, S3
if (x>=0.0f)
0x80004ca: 0xeef5 0x0a40 VCMP.F32 S1, #0.0
0x80004ce: 0xee70 0x1a80 VADD.F32 S3, S1, S0
0x80004d2: 0xed9f 0x2a22 VLDR S4, ??DataTable3_4 ; [0x800055c]
0x80004d6: 0xeef1 0xfa10 VMRS APSR_nzcv, FPSCR
0x80004da: 0xdb06 BLT.N ??main_6 ; 0x80004ea
angle = coeff_1 - coeff_1 * r;
0x80004dc: 0xee30 0x0ac0 VSUB.F32 S0, S1, S0
0x80004e0: 0xeec0 0x0a21 VDIV.F32 S1, S0, S3
0x80004e4: 0xeeb0 0x0a42 VMOV.F32 S0, S4
0x80004e8: 0xe005 B.N ??main_7 ; 0x80004f6
angle = coeff_2 - coeff_1 * r;
??main_6:
0x80004ea: 0xee30 0x0a60 VSUB.F32 S0, S0, S1
0x80004ee: 0xeec1 0x0a80 VDIV.F32 S1, S3, S0
0x80004f2: 0xed9f 0x0a16 VLDR S0, ??DataTable3 ; [0x800054c]
if (y < 
0.0f
)
??main_7:
0x80004f6: 0xeeb5 0x1a40 VCMP.F32 S2, #0.0
0x80004fa: 0xee00 0x0ac2 VMLS.F32 S0, S1, S4
0x80004fe: 0xeef1 0xfa10 VMRS APSR_nzcv, FPSCR
0x8000502: 0xbf48 IT MI
0x8000504: 0xeeb1 0x0a40 VNEGMI.F32 S0, S0
0x8000508: 0xed9f 0x1a15 VLDR S2, ??DataTable3_5 ; [0x8000560]
0x800050c: 0xedd4 0x0a03 VLDR S1, [R4, #12]
0x8000510: 0xee40 0x0a01 VMLA.F32 S1, S0, S2
0x8000514: 0xedc4 0x0a03 VSTR S1, [R4, #12]
if(fabsf(faze)>9.372f)
0x8000518: 0xeeb0 0x0ae0 VABS.F32 S0, S1
0x800051c: 0xeddf 0x0a11 VLDR S1, ??DataTable3_6 ; [0x8000564]
0x8000520: 0xeeb4 0x0a60 VCMP.F32 S0, S1
0x8000524: 0xeef1 0xfa10 VMRS APSR_nzcv, FPSCR
0x8000528: 0xbfa4 ITT GE
0x800052a: 0x2000 MOVGE R0, #0
0x800052c: 0x60e0 STRGE R0, [R4, #0xc]
DAC_DATA=(int)(faze*375f+3270f);
0x800052e: 0xed94 0x0a03 VLDR S0, [R4, #12]
0x8000532: 0xeddf 0x0a0d VLDR S1, ??DataTable3_7 ; [0x8000568]
0x8000536: 0xed9f 0x1a0d VLDR S2, ??DataTable3_8 ; [0x800056c]
0x800053a: 0xee00 0x1a20 VMLA.F32 S2, S0, S1
0x800053e: 0xeebd 0x0ac1 VCVT.SF32 S0, S2
0x8000542: 0x4912 LDR.N R1, ??DataTable3_16 ; 0x60020000 (1610743808)
0x8000544: 0xee10 0x0a10 VMOV R0, S0
0x8000548: 0x8008 STRH R0, [R1]
0x800054a: 0xe76b B.N ??main_0 ; 0x8000424
 if in unvind while(i<128) loop by hand (without i, just plain text), speed goes from 53 to 4KHz

linas2 · ‎2013-04-03

Posted on April 03, 2013 at 22:58

Last trick:

i divided main while(i<128) loop to 8 blocks (8*16=128) and speed goes to 5537KHz,

while(i<128)
{
//TODO
i++;
//TODO
i++;
//TODO
i++;
//TODO
i++;
/*--------------*/
//TODO
i++;
}

and with what i am more than happy .(dividing to 16x16 will start to slow speed down ) now only software EMPTY flag checking is left, and if it is possible to solve, i should gain 3-5KHz Thanks everyone who helped !

Tesla DeLorean · ‎2013-04-03

Posted on April 04, 2013 at 00:10

You have 'i' as an 8-bit unsigned integer don't you? Why?

Consider also why you need to strobe the clock high and low instead of just toggling it.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

linas2 · ‎2013-04-03

Posted on April 04, 2013 at 05:45

From: clive1
You have 'i' as an 8-bit unsigned integer don't you? Why?

what else i can use ? it's only one bite longer than i need Also toggle takes one command longer to do .it need to make EOR.