2012-09-10 09:49 AM
Dear STM-experts,
i wanted to use the FPU of my STM32F4 (Cortex-M4). To see if it's working properly i compared with this page:
He is using exactly the same processor and toolchain (With GCC Compiler).
Here is how long it takes with my settings:
REFERENCE /
Reference // My controller running from Flash // My controller running from Sram
long lX, lY, lZ;
lX = 123L; // 2 cycle // 2 cycle // 5 cycles
lY = 456L; // 2 cycle // 3 cycles // 3 cycles
lZ = lX*lY; // 5 cycles // 7 cycles // 9 cycles
fX = 123.456; // 3 cycles // 5 cycles // 4 cycles
fY = 9.99; // 3 cycles // 5 cycles // 4 cycles
fZ = fX * fY; // 6 cycles // 10 cycles // 10 cycles
fZ = sqrt(fY); // 20 cycles // 2742 cycles // 3405 cycles
fZ = sin(1.23); // 124 cycles // 1918 cycles // 2552
The settings are Arm architecture: v7EM
Arm core type: Cortex-M4
Arm FP Abi Type: Soft-FP (Or Hard, doens't make a huge difference)
Arm FPU Type: FPv4-SP-D16
GCC target: arm-unknown-eabi
So not only the floating point arithmetic is runing slower but also integer! And sin and sqrt are horrible!!
The offset of my cycle measurement is deducted.
In CP10 and CP11 is 0b11 so FPU should be enabled properly.
Do you have any idea what is wrong with my settings or my toolchain or whatever??
Thank you so much for you efforts!
Florian
2012-09-10 11:01 AM
How about sqrtf() and sinf(), to ensure you're working with floats?
Here with Keil, timing 1000 iterations, and subtracting null loop time. 1765.1 cycles sqrt 4608.1 cycles sin 42.0 cycles sqrtf 96.1 cycles sinf Without FPU 1746.1 cycles sqrt 4251.2 cycles sin 358.0 cycles sqrtf 924.3 cycles sinf2012-09-10 11:39 AM
2012-09-10 01:08 PM
I think if you're timing down at the cycle level you need to look at the code generated, and ideally use assembler, the compiler/code-generator tends to muddy the water, and optimization might cloud it further. For example a what point does a local/automatic get allocated on the stack vs held in a register.
The RAM speeds seem to imply a 1 cycle penalty per instruction (25-33% slower), this might come from how the memories are attached to the core, and which are optimized for instruction fetch, interaction with the pipelining. I don't know enough about the core, or ST's implementation options. FLASH 1.0 cycles lX = 123 2.0 cycles lY = 456 2.0 cycles lX * lY, 56088 2.0 cycles fX = 123.456001 3.0 cycles fY = 9.990000 3.0 cycles fX * fY, 1233.325439 1765.1 cycles sqrt 4575.1 cycles sin 42.0 cycles sqrtf 97.1 cycles sinf SRAM (0x20000000) 0.0 cycles lX = 123 (constant) 2.0 cycles lY = 456 (literal) 2.0 cycles lX * lY, 56088 2.0 cycles fX = 123.456001 2.0 cycles fY = 9.990000 4.0 cycles fX * fY, 1233.325439 2960.0 cycles sqrt 5907.0 cycles sin 54.0 cycles sqrtf 130.0 cycles sinf2012-09-10 11:01 PM
How about your optimization settings ?
The difference between -O0 and -O3 are often significant. As an example, the runtime of the DSP_Lib fft routine (float32) dropped for me from 10ms to 3ms with full optimization (-O0 vs. -O3).2012-09-11 10:41 AM
2012-09-11 11:48 AM
If this
STOPWATCH_XXX
macros are your measurements, you need to make at least thecyc[]
variable
volatile, too. In general, thevolatile
keyword is the weapon of choice to keep the compiler from removing things. A compiler has usually no concept of io space and peripherals - everything is just memory. If the compiler only sees a write access and no read, or a read access and no write, he will ruthlessly optimize it away. Withvolatile
, you tell him that hardware or external code interferes, and he must not remove or reorder your code.2012-09-11 12:49 PM
Thank you for that helpful tip!
When optimization level is set to 3 I save 1 cycle with the integer operation:lX = 123L; // 2 cycle
lY = 456L; // 2 cycle lZ = lX*lY; // 5 cycles--> Now 12 compared to the 9 from the website
But I loose 4 cycles compared to optimization level 0 with float operation:fX = 123.456; // 3 cycles
fY = 9.99; // 3 cycles fZ = fX * fY; // 6 cycles--> Now 18/22 compared to the 12 from the website.This is sooo confusing!!! Heieieieiei.... :D2012-09-11 11:43 PM
Perhaps they used another compiler. One peace of code does not necessarily translate into the same sequence of machine instructions with different compilers, even with similar settings.
You can compare the asm/machine instructions, if you deem this important. It's not a thing I am concerned about, though ...