Why is my Cortex-M4 taking too much cycles?

florianaugustin9 · ‎2012-09-10

Posted on September 10, 2012 at 18:49

Dear STM-experts,

i wanted to use the FPU of my STM32F4 (Cortex-M4). To see if it's working properly i compared with this page:

http://www.micromouseonline.com/2011/10/26/stm32f4-the-first-taste-of-speed/?doing_wp_cron=1347294891.0981290340423583984375

He is using exactly the same processor and toolchain (With GCC Compiler).

Here is how long it takes with my settings:

REFERENCE /

Reference // My controller running from Flash // My controller running from Sram

long lX, lY, lZ;

lX = 123L; // 2 cycle // 2 cycle // 5 cycles

lY = 456L; // 2 cycle // 3 cycles // 3 cycles

lZ = lX*lY; // 5 cycles // 7 cycles // 9 cycles

fX = 123.456; // 3 cycles // 5 cycles // 4 cycles

fY = 9.99; // 3 cycles // 5 cycles // 4 cycles

fZ = fX * fY; // 6 cycles // 10 cycles // 10 cycles

fZ = sqrt(fY); // 20 cycles // 2742 cycles // 3405 cycles

fZ = sin(1.23); // 124 cycles // 1918 cycles // 2552

The settings are Arm architecture: v7EM

Arm core type: Cortex-M4

Arm FP Abi Type: Soft-FP (Or Hard, doens't make a huge difference)

Arm FPU Type: FPv4-SP-D16

GCC target: arm-unknown-eabi

So not only the floating point arithmetic is runing slower but also integer! And sin and sqrt are horrible!!

The offset of my cycle measurement is deducted.

In CP10 and CP11 is 0b11 so FPU should be enabled properly.

Do you have any idea what is wrong with my settings or my toolchain or whatever??

Thank you so much for you efforts!

Florian

Tesla DeLorean · ‎2012-09-10

Posted on September 10, 2012 at 20:01

How about sqrtf() and sinf(), to ensure you're working with floats?

Here with Keil, timing 1000 iterations, and subtracting null loop time.

1765.1 cycles sqrt

4608.1 cycles sin

42.0 cycles sqrtf

96.1 cycles sinf

Without FPU

1746.1 cycles sqrt

4251.2 cycles sin

358.0 cycles sqrtf

924.3 cycles sinf

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

florianaugustin9 · ‎2012-09-10

Posted on September 10, 2012 at 20:39

Hey :)

Here are my results:

Sinf(1.23); Flash: 217 cycles//SRAM: 170 cycles

fZ = sqrt(fY); Flash 29 cycles// SRAM: 33 cycles

Somehow this is a bit confusing....

I don't understand how integer arithmetic can take different cpu time.... On same compiler and cpu...

Thank you for replies!

Tesla DeLorean · ‎2012-09-10

Posted on September 10, 2012 at 22:08

I think if you're timing down at the cycle level you need to look at the code generated, and ideally use assembler, the compiler/code-generator tends to muddy the water, and optimization might cloud it further. For example a what point does a local/automatic get allocated on the stack vs held in a register.

The RAM speeds seem to imply a 1 cycle penalty per instruction (25-33% slower), this might come from how the memories are attached to the core, and which are optimized for instruction fetch, interaction with the pipelining. I don't know enough about the core, or ST's implementation options.

FLASH

1.0 cycles lX = 123

2.0 cycles lY = 456

2.0 cycles lX * lY, 56088

2.0 cycles fX = 123.456001

3.0 cycles fY = 9.990000

3.0 cycles fX * fY, 1233.325439

1765.1 cycles sqrt

4575.1 cycles sin

42.0 cycles sqrtf

97.1 cycles sinf

SRAM (0x20000000)

0.0 cycles lX = 123 (constant)

2.0 cycles lY = 456 (literal)

2.0 cycles lX * lY, 56088

2.0 cycles fX = 123.456001

2.0 cycles fY = 9.990000

4.0 cycles fX * fY, 1233.325439

2960.0 cycles sqrt

5907.0 cycles sin

54.0 cycles sqrtf

130.0 cycles sinf

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

frankmeyer9 · ‎2012-09-10

Posted on September 11, 2012 at 08:01

How about your optimization settings ?

The difference between -O0 and -O3 are often significant.

As an example, the runtime of the DSP_Lib fft routine (float32) dropped for me from 10ms to 3ms with full optimization (-O0 vs. -O3).

florianaugustin9 · ‎2012-09-11

Posted on September 11, 2012 at 19:41

Optimization level is zero. I would like to try to set it to another value but he optimizes my measurement away. My routine for measurement:

int cyc[2],offset;

float x;

volatile unsigned int *DWT_CYCCNT = (volatile unsigned int *)0xE0001004; //address of the register

volatile unsigned int *DWT_CONTROL = (volatile unsigned int *)0xE0001000; //address of the register

volatile unsigned int *SCB_DEMCR = (volatile unsigned int *)0xE000EDFC; //address of the register

#define STOPWATCH_START { cyc[0] = *DWT_CYCCNT;}

#define STOPWATCH_STOP { cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0]-offset; }

STOPWATCH_START

__asm volatile(''nop'');

cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0];

offset = cyc[1] - 1;

STOPWATCH_START

lX = 123L; // 2 cycle

lY = 456L; // 2 cycle

lZ = lX*lY; // 5 cycles

STOPWATCH_STOP

So after this in cyc[1] is the value with the offset being subtracted.

Any idea how I can get this working with higher optimization leven than zero??

Thank you!

frankmeyer9 · ‎2012-09-11

Posted on September 11, 2012 at 20:48

If this

STOPWATCH_XXX

macros are your measurements, you need to make at least the

cyc[]

variable

volatile, too.

In general, the

volatile

keyword is the weapon of choice to keep the compiler from removing things.

A compiler has usually no concept of io space and peripherals - everything is just memory. If the compiler only sees a write access and no read, or a read access and no write, he will ruthlessly optimize it away. With

volatile

, you tell him that hardware or external code interferes, and he must not remove or reorder your code.

florianaugustin9 · ‎2012-09-11

Posted on September 11, 2012 at 21:49

Thank you for that helpful tip!

When optimization level is set to 3 I save 1 cycle with the integer operation:

lX = 123L; // 2 cycle

lY = 456L; // 2 cycle

lZ = lX*lY; // 5 cycles

--> Now 12 compared to the 9 from the website

But I loose 4 cycles compared to optimization level 0 with float operation:

fX = 123.456; // 3 cycles

fY = 9.99; // 3 cycles

fZ = fX * fY; // 6 cycles

--> Now 18/22 compared to the 12 from the website.

This is sooo confusing!!! Heieieieiei.... :D

frankmeyer9 · ‎2012-09-11

Posted on September 12, 2012 at 08:43

Perhaps they used another compiler. One peace of code does not necessarily translate into the same sequence of machine instructions with different compilers, even with similar settings.

You can compare the asm/machine instructions, if you deem this important.

It's not a thing I am concerned about, though ...