2012-08-21 03:13 AM
Dear STM32 experts,
i want to use the FPU of my STM32F4Discovery. So i did the following compiler options:-Arm Architecture: v7EM-Arm Core Type: Cortex-M4-Arm FPU Type: FPv4-SP-D16-GCC Target: arm-unknown-eabiAfter this in CP10 and CP11 is 0b11, which should be good.But i found som test code on the net (http://www.mikrocontroller.net/topic/261021
:(&sharpdefine
CORE_SysTickEn() (*((u32*)
0xE0001000
)) =
0x40000001
&sharpdefine
CORE_SysTickDis() (*((u32*)
0xE0001000
)) =
0x40000000
&sharpdefine
CORE_GetSysTick() (*((u32*)
0xE0001004
))
float
f =
1.01f
;
CORE_SysTickEn();
vu32 it = CORE_GetSysTick();
float
f2 = f *
2.29f
;
vu32 it2 = CORE_GetSysTick() - it;
He needs for this 11 cycles but 6 are for the clock calculation.So he needs 5 cycles for the multiplication and assignment.But I need 18 cycles total which means 12 for the multiplication and assignment :( :(He also uses the STM32F4Discovery.Any ideas what could be the reason for that? 12 are definitely too much for this multiplication and the assignment...My disassembly code for this is:float f2 = f * 2.29f;
ED977A03 vldr s14, [r7, &sharp12]
EDDF7A0B vldr s15, 0x080003F0 <__text_start__+0x5C>
EE677A27 vmul.f32 s15, s14, s15
EDC77A02 vstr s15, [r7, &sharp8]
Thank you for responses!Florian #stm32f4discovery-fpu-cp10-cp112012-08-21 04:05 AM
But I need 18 cycles total which means 12 for the multiplication and assignment :( :(
How do you measure cycles ?
2012-08-21 04:25 AM
With these three lines:
vu32 it = CORE_GetSysTick();
float
f2 = f *
2.29f
;
vu32 it2 = CORE_GetSysTick() - it;
Then I set a breakpoint after this line and check which value it2 has. This are the cycles!2012-08-21 04:42 AM
Without looking up for this function, I suggest to subtract the difference of an empty call:
vu32 it = CORE_GetSysTick();
vu32 it2 = CORE_GetSysTick();
vu32 diff = it2 - it;
i.e. subtractingdiff
from the result you get.2012-08-21 05:50 AM
Executing from RAM or FLASH?
Figure a random read from FLASH is in the order 35-42 ns Consider also pipelining, better to time multiple linear instantiations, and compute THROUGHPUT.2012-08-21 11:28 AM
Oh, thats a really nice idea. I'm running 168 Mhz so i have 5 wait states for the flash.
But unfortunately i have no idea if i'm running from flash or ram.I used the standard settings from Crossworks IDE. If I compile the project there are 4.8 kb of flash used and 0.2 kb ram. I suppose it runs from flash? And I would be really glad if you could explain the pipeline thing to me. I basically know what pipeline is from university, but i did'nt get it.Thank you for your help!2012-08-21 12:06 PM
Pipelining is the mechanism where multiple instructions are in-flight at a given time, at various stages of execution.
You break down the execution into multiple stages, each of which can complete within the single clock tick. Minimizing the work at each stage allows the clock rate to be increased. The apparent THROUGHPUT of a RISC process may appear as a single cycle, ie the take up of instructions. The LATENCY can be much longer, ie the time it takes actually arrive at the answer. Bubbles or hazards can occur in the pipeline, if the answer from the prior instruction is not ready when the current instruction needs it. If these are handled in hardware the processor performs an empty cycle, or has methods to forward data between stages in an expedited fashion. Clever optimizing compilers will try to avoid dependency chains. Dumb silicon relies on compilers to remove dependency hazards (Itanium/VLIW) Multiplies, and more critically Divisions, can take a significant number of cycles to complete. Loads and Stores will also consume multiple cycles, and will depend on the bus(es) the transaction occurs across, and if write buffer(s) are available/busy.http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337e/CACCIFED.html
2012-08-21 09:58 PM
Oh, thats a really nice idea. I'm running 168 Mhz so i have 5 wait states for the flash.
But unfortunately i have no idea if i'm running from flash or ram.
I used the standard settings from Crossworks IDE. If I compile the project there are 4.8 kb of flash used and 0.2 kb ram. I suppose it runs from flash?
You are definitely running from Flash. If not taking special effort to move routines into RAM, you always run from Flash.I'm running 168 Mhz so i have 5 wait states for the flash. That refers to the internal interface to the Flash. This wait states are partially masked by ART, so code execution is faster.
2012-08-22 05:15 AM
Thank you for that information!
I tried to run from ram (Section Placement: Flash copy to ram). I hope the code will be copied from flash to ram every power on.I tried running it and it compiles bit when starting to debug i get ''unknow function at 0x20000042''.I selected the flash_to_ram_placement.xml document as section placement file.So i don't know what to do :DPerhaps someone has an idea?