cancel
Showing results for 
Search instead for 
Did you mean: 

Using NUCLEO F756ZG - Checking speed of calculations

JJAYAR
Associate II

Hello all,

Just started working with STM - Nucleo is my first trial at this

I am operating at 216MHz

I noticed this while loop is taking almost 80uSec to complete with the simple trial floating point calculations

& output a square wave that I can see on scope

At 216MHz , this is about 17400 clock cycles

I was wondering why the chip would take this much time to perform this simple calculation

Any comments/suggestions would be very welcome

Thanx for your help in advance

 

double c=0;//global

double a=0;//global

double pi=3.1415;//global

....

while (1)

{

//HAL_GPIO_TogglePin(LD1_GPIO_Port, LD1_Pin);this is really slow!!!

GPIOC->ODR = 0x00000100;//see this on scope

c= (a/pi + sqrt(pi))*(a/pi + sqrt(pi));//simple calculation

GPIOC->ODR = 0x01000000;//see this on scope

c=(a/pi + sqrt(pi))*(a/pi + sqrt(pi));//simple calculation

if(c==0)

GPIOB->ODR = 0x00000001;

/* USER CODE END WHILE */

/* USER CODE BEGIN 3 */

}

......

Thanx 

Jay

5 REPLIES 5
gregstm
Senior III

Just some thoughts -

- glad you are writing direct to the register for your timing purposes. But I would suggest you use the BSRR register rather than the ODR register, that way you can toggle multiple bits independently and create a more elaborate trace with multiple pins if needed. eg.

#define RED_LED_ON GPIOB->BSRR = GPIO_BSRR_BS_14

#define RED_LED_OFF GPIOB->BSRR = GPIO_BSRR_BR_14

- have a look at the code at the assembly level, even step through the code using the software simulator (a great, yet underrated tool). You may not understand everything that is going on, but you should be able to see if the software is making "calls" to libraries instead of quicker inline instructions. Then you can experiment with the code and compiler settings etc. to see if you can get the calculation working better/faster.

LCE
Principal

Use the cycle counter embedded into the ARM core (documentation on ARM website).

 

/* CPU cycle count activation for debugging */
#if DEBUG_CPU_TIMING
	CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
	DWT->LAR = 0xC5ACCE55;
	DWT->CYCCNT = 0;
	DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
	DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
#endif

usage:
#if DEBUG_CPU_TIMING
	u32CycFuncStart = DWT->CYCCNT;
#endif

	FunctionUnderTest();

#if DEBUG_CPU_TIMING
	u32CycFuncStop = DWT->CYCCNT;
	u32CycFuncSum[1] = u32CycFuncStop - u32CycFuncStart;
	u64CycSumFunc[1] += (uint64_t)u32CycFuncSum[1];
	if( u32CycFuncSum[1] > u32CycFuncMax[1] ) u32CycFuncMax[1] = u32CycFuncSum[1];
	u32CycFuncCalls[1]++;
#endif

 

 

For even more accuracy, disable the ISRs with __disable_irq();

Have you enabled the hardware floating point unit (FPU) in the IDE? Somewhere in the project settings...

Edit: I would not call "sqrt(pi))" a simple calculation for real electronics HW. 😉

 

Do you have caches enabled?

A need to consider to (enable) use HW FP Unit, maybe with ARM DSP functions instead...?
Is the HW FPU enabled?

"At 216MHz , this is about 17400 clock cycles":
do you mean it takes approx. 80.5 micro-sec? (to see the GPIO toggling, yes, you have mentioned)

Assuming 4 MCU cycles as average for a single C-code instruction, and SQRT needs at least 100 cycles (?), you use it twice, plus the divisions, and all as double: you might have approx. 400 instructions * 4 cycles = 7.4 micro-sec.

"hmmm", a remarkable difference: 80.5 micro-sec. vs. 7.4 ms (for a half-cycle)
(or do you mean 2x 7.4 ms, for a full GPIO cycle?)

Do you run this code with best optimization (no -g3, -Og), with debug none and -O3?

If you want to fight for the best speed: also consider if you have DTCM, ITCM, to place variables and code there for fastest speed:
Especially, if you run the code from flash and it has this FLASH_LATENCY... Check, if you have ITCM and DTCM and optimize "memory layout" (move the code to ITCM and the variables, stack to DTCM).

For sure a bit unexpected bad results, but I think, still room for performance improvements on this MCU/board.

 

Thanx - very helpful info

Thanx- I was running off flash & not the ram

One the contrller was configured to run off RAM, most latencies vanished!