cancel
Showing results for 
Search instead for 
Did you mean: 

Why is my Cortex-M4 taking too much cycles?

florianaugustin9
Associate II
Posted on September 10, 2012 at 18:49

Dear STM-experts,

i wanted to use the FPU of my STM32F4 (Cortex-M4). To see if it's working properly i compared with this page:

http://www.micromouseonline.com/2011/10/26/stm32f4-the-first-taste-of-speed/?doing_wp_cron=1347294891.0981290340423583984375

He is using exactly the same processor and toolchain (With GCC Compiler).

Here is how long it takes with my settings:

REFERENCE / 

Reference // My controller running from Flash // My controller running from Sram

long lX, lY, lZ; 

lX = 123L; // 2 cycle // 2 cycle // 5 cycles

lY = 456L; // 2 cycle // 3 cycles // 3 cycles

lZ = lX*lY; // 5 cycles // 7 cycles // 9 cycles

fX = 123.456; // 3 cycles // 5 cycles // 4 cycles

fY = 9.99; // 3 cycles // 5 cycles // 4 cycles

fZ = fX * fY; // 6 cycles // 10 cycles // 10 cycles

fZ = sqrt(fY); // 20 cycles // 2742 cycles // 3405 cycles

fZ = sin(1.23); // 124 cycles // 1918 cycles // 2552

The settings are      Arm architecture: v7EM

   Arm core type: Cortex-M4

   Arm FP Abi Type: Soft-FP (Or Hard, doens't make a huge difference)

   Arm FPU Type: FPv4-SP-D16

   GCC target: arm-unknown-eabi

So not only the floating point arithmetic is runing slower but also integer! And sin and sqrt are horrible!!

The offset of my cycle measurement is deducted.

In CP10 and CP11 is 0b11 so FPU should be enabled properly.

Do you have any idea what is wrong with my settings or my toolchain or whatever??

Thank you so much for you efforts!

Florian

8 REPLIES 8
Posted on September 10, 2012 at 20:01

How about sqrtf() and sinf(), to ensure you're working with floats?

Here with Keil, timing 1000 iterations, and subtracting null loop time.

1765.1 cycles sqrt                                                              

4608.1 cycles sin                                                               

  42.0 cycles sqrtf                                                             

  96.1 cycles sinf                                                              

                                                                                

Without FPU

1746.1 cycles sqrt                                                              

4251.2 cycles sin                                                               

 358.0 cycles sqrtf                                                             

 924.3 cycles sinf                                                              

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
florianaugustin9
Associate II
Posted on September 10, 2012 at 20:39

Hey :)

Here are my results:

Sinf(1.23); Flash: 217 cycles//SRAM: 170 cycles

fZ = sqrt(fY); Flash 29 cycles// SRAM: 33 cycles

Somehow this is a bit confusing....

I don't understand how integer arithmetic can take different cpu time.... On same compiler and cpu...

Thank you for replies!

Posted on September 10, 2012 at 22:08

I think if you're timing down at the cycle level you need to look at the code generated, and ideally use assembler, the compiler/code-generator tends to muddy the water, and optimization might cloud it further. For example a what point does a local/automatic get allocated on the stack vs held in a register.

The RAM speeds seem to imply a 1 cycle penalty per instruction (25-33% slower), this might come from how the memories are attached to the core, and which are optimized for instruction fetch, interaction with the pipelining. I don't know enough about the core, or ST's implementation options.

FLASH

   1.0 cycles lX = 123

   2.0 cycles lY = 456

   2.0 cycles lX * lY, 56088

   2.0 cycles fX = 123.456001

   3.0 cycles fY = 9.990000

   3.0 cycles fX * fY, 1233.325439

1765.1 cycles sqrt

4575.1 cycles sin

  42.0 cycles sqrtf

  97.1 cycles sinf

SRAM (0x20000000)

   0.0 cycles lX = 123  (constant)                                                       

   2.0 cycles lY = 456  (literal)

   2.0 cycles lX * lY, 56088

   2.0 cycles fX = 123.456001

   2.0 cycles fY = 9.990000

   4.0 cycles fX * fY, 1233.325439

2960.0 cycles sqrt

5907.0 cycles sin

  54.0 cycles sqrtf

 130.0 cycles sinf

 
Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
frankmeyer9
Associate II
Posted on September 11, 2012 at 08:01

How about your optimization settings ?

The difference between -O0 and -O3 are often significant.

As an example, the runtime of the DSP_Lib fft routine (float32) dropped for me from 10ms to 3ms with full optimization (-O0 vs. -O3).

florianaugustin9
Associate II
Posted on September 11, 2012 at 19:41

Optimization level is zero. I would like to try to set it to another value but he optimizes my measurement away. My routine for measurement:

int cyc[2],offset;

  float x;

  volatile unsigned int *DWT_CYCCNT = (volatile unsigned int *)0xE0001004; //address of the register

  volatile unsigned int *DWT_CONTROL = (volatile unsigned int *)0xE0001000; //address of the register

  volatile unsigned int *SCB_DEMCR = (volatile unsigned int *)0xE000EDFC; //address of the register

  #define STOPWATCH_START { cyc[0] = *DWT_CYCCNT;}

  #define STOPWATCH_STOP { cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0]-offset; }

  STOPWATCH_START

  __asm volatile(''nop'');

  cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0];

  offset = cyc[1] - 1;

  STOPWATCH_START

     lX = 123L; // 2 cycle 

     lY = 456L; // 2 cycle

     lZ = lX*lY; // 5 cycles

  STOPWATCH_STOP

So after this in cyc[1] is the value with the offset being subtracted.

Any idea how I can get this working with higher optimization leven than zero??

Thank you!

frankmeyer9
Associate II
Posted on September 11, 2012 at 20:48

If this 

STOPWATCH_XXX

macros are your measurements, you need to make at least the

cyc[]

variable

volatile, too.

In general, the

volatile

keyword is the weapon of choice to keep the compiler from removing things.

A compiler has usually no concept of io space and peripherals - everything is just memory. If the compiler only sees a write access and no read, or a read access and no write, he will ruthlessly optimize it away. With

volatile

, you tell him that hardware or external code interferes, and he must not remove or reorder your code.

florianaugustin9
Associate II
Posted on September 11, 2012 at 21:49

Thank you for that helpful tip!

When optimization level is set to 3 I save 1 cycle with the integer operation:

lX = 123L; // 2 cycle

 lY = 456L; // 2 cycle

 lZ = lX*lY; // 5 cycles

--> Now 12 compared to the 9 from the website

But I loose 4 cycles compared to optimization level 0 with float operation:

fX = 123.456; // 3 cycles

 fY = 9.99; // 3 cycles

 fZ = fX * fY; // 6 cycles

--> Now 18/22 compared to the 12 from the website.

This is sooo confusing!!! Heieieieiei.... :D

frankmeyer9
Associate II
Posted on September 12, 2012 at 08:43

Perhaps they used another compiler. One peace of code does not necessarily translate into the same sequence of machine instructions with different compilers, even with similar settings.

You can compare the asm/machine instructions, if you deem this important.

It's not a thing I am concerned about, though ...