STM32F4 function call overhead

andreas239955 · ‎2014-04-08

Posted on April 08, 2014 at 17:15

Hello,

I'm fairly new to the STM32 processors, but got to like them quickly. Currently working with the STM32F401RE on a NUCLEO board, I made an observation I can't explain. Maybe somebody else can (easily).

I let the STM32CubeMX generate the startup code. I configured it to use the HSI source (HSI RC, f = 16MHz) as SYSCLK, let the AHB prescaler at default 1, thus the HCLK is 16MHz too.

Then I configured the SysTick to 100us. I know this is a bit shorter than usual, but for this project it's fine. I wrote a test that allows me to check with my watch that it really uns at 10kHz (e.g. wait 50000 times for the tick to occur results in 5s). That works fine.

Now to the problem. The time for calling two very short functions seems far too long to me. Pseudocode:

unsigned long long diff_us, time1_us, time2_us;

drvSysTick_getUptime(&time1_us); // Time driver function, see below

drvSysTick_getUptime(&time2_us);

diff_us = time2_us – time1_us; // diff_us = 118us

My time driver functions, which base on the SysTick, return values for diff like 118us. Using latest Keil MDK, switching optimization doesn't reduce the value.

118us at 16MHz correspond to 1888 CPU cycles. I have no idea what could take that long. drvSysTick_getUptime() looks exactly like this:

int drvSysTick_getUptime(unsigned long long* uptime)

{

unsigned int sysTickValue; // System Tick counter counts down!

DISABLE_INTERRUPTS(); // __disable_irq()

sysTickValue = SysTick->VAL;

*uptime = sInstDscr.uptime; // Updated by ISR, value in us

ENABLE_INTERRUPTS(); // __enable_irq()

sysTickValue = SysTick->LOAD - sysTickValue;

*uptime += (unsigned long long)sysTickValue * 1000000 / SystemCoreClock;

return R_SUCCESS;

}

Any hint appreciated.

Kind regards,

Res

stm322399 · ‎2014-04-08

Posted on April 08, 2014 at 17:29

The division is a serious candidate for eating CPU cycles.

I don't know your environment, check for hard-float support.

andreas239955 · ‎2014-04-08

Posted on April 08, 2014 at 20:04

I checked the disassembly window. I see that for the multiplication the UMULL instruction is used. For the division, the __aeabi_uldivmod() function is called, which is longer than I expected. I'll have a look at it, thanks.

andreas239955 · ‎2014-05-02

Posted on May 02, 2014 at 16:12

Just to close this case and confirm Laurent's suspicion: operations on long long integers are executed using library functions, even when hard floating point support is enabled. The FPU accelerates floating-point operations, not integer operations.

Thanks for the hint.

chen · ‎2014-05-02

Posted on May 02, 2014 at 16:25

Hi

The FPU only works on Floats - it does not work on Long

Hence the call to the library routine.

I doubt there is a divide operation in the ARM.

Divide traditionally takes multiple clock cycles to complete,

The ARM is a RISC processor (completes instructions in 1 instruction cycle).

Divide is only found on CISC processors.

Apparently, the MULT instruction can be a SIMD on the ARM.

stm322399 · ‎2014-05-02

Posted on May 02, 2014 at 17:40

Right, integer division has nothing to do with float support, my bad.

I have been abused by the fact that hardware division (as well as float operations) requires compiler support.

F4 are cortex-M4, and cortex-M4 implements ARMV7-M that has UDIV and SDIV instructions.

GCC (that is not too old) generates hardware divide instruction for M4, if you ask gently by adding -march=armv7-m or -mcpu=cortex-m4 to the compiler switches.