How to convert old code to use FPU ?

SSRIN.1 · ‎2022-01-20

Hi I am trying to optimise my gait generation code for my robot. I use a lot of float operations . I want to use FPU to accelerate computing.

As an experiment I am using the following task

TickType_t start = xTaskGetTickCount();
 while( i<100000000)
       {
    	   float a = sin(1.7)+sin(0.58)+cos(0.66)/(2*(atan2(3,5)+sin(0.11)));
    	   i++;
       }
TickType_t end = xTaskGetTickCount();
sprintf(msg,"%lu \r\n",end-start);
HAL_UART_Transmit(&huart3,( uint8_t*) &msg,strlen(msg), 15); 
 i = 0;

I also setup the flags

But the time taking to execute the task is same as before setting up the above flags .

I was under assumption that once flag is set all legacy c defined math operation will be converted in FPU using code ? Am I wrong in the assumption ?

Please Advice !

Thank You

Yours Sincerely,

S.Shyam

thundertronics.com · ‎2022-01-21

I think that FPU is used regardless of __FPU_PRESENT symbol, because it is already defined in

"\Drivers\CMSIS\Device\ST\STM32H7xx\Include\stm32h750xx.h" (this path for STM32H750):

 #define __FPU_PRESENT       1    /*!< FPU present

I think you can evaluate non-FPU performance by setting Floating-point unit to "None":

And maybe some additional changes to HAL driver defines.

For performance boost use integer implementation with lookup tables, CORDIC, polynomial, etc.

For SIN/COS I use simple lookup table with 32 bit phase accumulator

For ATAN2 I use polynomial ATAN from Efficient Approximations for the Arctangent Function and considering to replace it with lookup table.

SSRIN.1 · ‎2022-01-21

One person told me that I have to use DSP in the CubeF7 driver to work with FPU.

https://github.com/STMicroelectronics/STM32CubeF7/tree/master/Drivers/CMSIS/DSP/Source

They are not related, right? DSP library can be used regardless of using FPU correct ?

SSRIN.1 · ‎2022-01-22

whats the difference between FPU andFloating point ABI ?

TDK · ‎2022-01-24

DSP and FPU are independent. Use of one does not require or imply use of another.

The "Floating point ABI" is the correct place to set whether the compiler creates FPU instructions or emulated floating point support.

If you feel a post has answered your question, please click "Accept as Solution".

Bob S · ‎2022-01-25

Line 4 in your sample code is "invariant" (the answer never changes) and the result (stored in "a") is never used. Likewise, the resulting value of "i" is never used outside the loop. Depending on your optimization setting, that line, and in fact the entire while() loop, may be optimized out of the executable code.

SSRIN.1 · ‎2022-01-29

After experiments, these are my observations.

I used the DWT cycle counter to measure the cycles consumed.

In STM32 cube the compiler by default is set to use FPU. Interestingly in the STM32 cube version of freeRTOS, the FPU flag is compulsorily required to compile.
Compared to software, Hard FPU is at least 10 times faster. To this add DSP compiled for FPU, we get 20 times faster.
The cycle consumed increases as the magnitude of operand increases(12.45f>2.45f).
Apart from this compiler optimization also seems to play a role.

Measuring clock cycles of a repeating calculation in for loop is more consistent and stable compared to just writing the line n times one after the other[More cycles are consumed in some repeating code of the same operation].
Declarations are optimized in a weird way, say I use a variable (t) in the calculation.

I do two type of declaration

//Type 1

float t = 2.45f

//Type 2

float t = 2.45f

float32_t t2 = t;

Among both types of variable (t) declaration. Type 2 seems to be 20-30 cycles faster. even though I am not even using t2.

Thank you everyone for the advice !

KnarfB · ‎2022-01-30

I'm late here, but anyway: Check the code disassembly to better understand what happened and what you are measuring. Are FPU instructions generated or function calls to SW emulations? Does the assembler code correspond 1:1 to the source code?

In your above loop, chances are high that the compiler optimizes away calculations with unused results.

Declare variables volatile to ensure that the calculations are not optimzed away.

The compiler may reorder instructions statically (at compile time) for better performance.

On a Cortex-M7, the CPU may reorder memory access dynamically (at execution time) for better performance. This, together with cache behaviour, makes it quite complex to get accurate timing. See __DMB, memory barries etc..

For example, I doubt that in general

> 3. The cycle consumed increases as the magnitude of operand increases(12.45f>2.45f).

hth

KnarfB