cancel
Showing results for 
Search instead for 
Did you mean: 

Problems with FPU STM32F4

florianaugustin9
Associate II
Posted on August 21, 2012 at 12:13

Dear STM32 experts,

i want to use the FPU of my STM32F4Discovery. So i did the following compiler options:

-Arm Architecture: v7EM

-Arm Core Type: Cortex-M4

-Arm FPU Type: FPv4-SP-D16

-GCC Target: arm-unknown-eabi

After this in CP10 and CP11 is 0b11, which should be good.

But i found som test code on the net (

http://www.mikrocontroller.net/topic/261021

:(

&sharpdefine

CORE_SysTickEn()    (*((u32*)

0xE0001000

)) =

0x40000001

&sharpdefine

CORE_SysTickDis()   (*((u32*)

0xE0001000

)) =

0x40000000

&sharpdefine

CORE_GetSysTick()   (*((u32*)

0xE0001004

))

float

f =

1.01f

;

CORE_SysTickEn();

vu32 it = CORE_GetSysTick();

float

f2 = f *

2.29f

;

vu32 it2 = CORE_GetSysTick() - it;

He needs for this 11 cycles but 6 are for the clock calculation.

So he needs 5 cycles for the multiplication and assignment.

But I need 18 cycles total which means 12 for the multiplication and assignment :( :(

He also uses the STM32F4Discovery.

Any ideas what could be the reason for that? 12 are definitely too much for this multiplication and the assignment...

My disassembly code for this is:

float f2 = f * 2.29f;

   ED977A03    vldr s14, [r7, &sharp12]

   EDDF7A0B    vldr s15, 0x080003F0 <__text_start__+0x5C>

   EE677A27    vmul.f32 s15, s14, s15

  

EDC77A02    vstr s15, [r7, &sharp8]

Thank you for responses!

Florian

#stm32f4discovery-fpu-cp10-cp11
8 REPLIES 8
frankmeyer9
Associate II
Posted on August 21, 2012 at 13:05

But I need 18 cycles total which means 12 for the multiplication and assignment :( :(

 

How do you measure cycles ?

florianaugustin9
Associate II
Posted on August 21, 2012 at 13:25

With these three lines:

vu32 it = CORE_GetSysTick();

float

 f2 = f * 

2.29f

;

vu32 it2 = CORE_GetSysTick() - it;

Then I set a breakpoint after this line and check which value it2 has. This are the cycles!

frankmeyer9
Associate II
Posted on August 21, 2012 at 13:42

Without looking up for this function, I suggest to subtract the difference of an empty call:

vu32 it  = CORE_GetSysTick();

vu32 it2 = CORE_GetSysTick();

vu32 diff = it2 - it;

i.e. subtracting

diff

from the result you get.

Posted on August 21, 2012 at 14:50

Executing from RAM or FLASH?

Figure a random read from FLASH is in the order 35-42 ns

Consider also pipelining, better to time multiple linear instantiations, and compute THROUGHPUT.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
florianaugustin9
Associate II
Posted on August 21, 2012 at 20:28

Oh, thats a really nice idea. I'm running 168 Mhz so i have 5 wait states for the flash.

But unfortunately i have no idea if i'm running from flash or ram.

I used the standard settings from Crossworks IDE. If I compile the project there are 4.8 kb of flash used and 0.2 kb ram. I suppose it runs from flash? 

And I would be really glad if you could explain the pipeline thing to me. I basically know what pipeline is from university, but i did'nt get it.

Thank you for your help!

Posted on August 21, 2012 at 21:06

Pipelining is the mechanism where multiple instructions are in-flight at a given time, at various stages of execution.

You break down the execution into multiple stages, each of which can complete within the single clock tick. Minimizing the work at each stage allows the clock rate to be increased.

The apparent THROUGHPUT of a RISC process may appear as a single cycle, ie the take up of instructions. The LATENCY can be much longer, ie the time it takes actually arrive at the answer.

Bubbles or hazards can occur in the pipeline, if the answer from the prior instruction is not ready when the current instruction needs it. If these are handled in hardware the processor performs an empty cycle, or has methods to forward data between stages in an expedited fashion. Clever optimizing compilers will try to avoid dependency chains. Dumb silicon relies on compilers to remove dependency hazards (Itanium/VLIW)

Multiplies, and more critically Divisions, can take a significant number of cycles to complete. Loads and Stores will also consume multiple cycles, and will depend on the bus(es) the transaction occurs across, and if write buffer(s) are available/busy.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337e/CACCIFED.html

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
frankmeyer9
Associate II
Posted on August 22, 2012 at 06:58

Oh, thats a really nice idea. I'm running 168 Mhz so i have 5 wait states for the flash.

 

But unfortunately i have no idea if i'm running from flash or ram.

 

I used the standard settings from Crossworks IDE. If I compile the project there are 4.8 kb of flash used and 0.2 kb ram. I suppose it runs from flash?

 

You are definitely running from Flash.

If not taking special effort to move routines into RAM, you always run from Flash.

I'm running 168 Mhz so i have 5 wait states for the flash.

That refers to the internal interface to the Flash. This wait states are partially masked by ART, so code execution is faster.

florianaugustin9
Associate II
Posted on August 22, 2012 at 14:15

Thank you for that information!

I tried to run from ram (Section Placement: Flash copy to ram). I hope the code will be copied from flash to ram every power on.

I tried running it and it compiles bit when starting to debug i get ''unknow function at 0x20000042''.

I selected the flash_to_ram_placement.xml document as section placement file.

So i don't know what to do :D

Perhaps someone has an idea?