2024-02-13 11:26 PM
Hello all,
Just started working with STM - Nucleo is my first trial at this
I am operating at 216MHz
I noticed this while loop is taking almost 80uSec to complete with the simple trial floating point calculations
& output a square wave that I can see on scope
At 216MHz , this is about 17400 clock cycles
I was wondering why the chip would take this much time to perform this simple calculation
Any comments/suggestions would be very welcome
Thanx for your help in advance
double c=0;//global
double a=0;//global
double pi=3.1415;//global
....
while (1)
{
//HAL_GPIO_TogglePin(LD1_GPIO_Port, LD1_Pin);this is really slow!!!
GPIOC->ODR = 0x00000100;//see this on scope
c= (a/pi + sqrt(pi))*(a/pi + sqrt(pi));//simple calculation
GPIOC->ODR = 0x01000000;//see this on scope
c=(a/pi + sqrt(pi))*(a/pi + sqrt(pi));//simple calculation
if(c==0)
GPIOB->ODR = 0x00000001;
/* USER CODE END WHILE */
/* USER CODE BEGIN 3 */
}
......
Thanx
Jay
2024-02-14 10:29 PM
Just some thoughts -
- glad you are writing direct to the register for your timing purposes. But I would suggest you use the BSRR register rather than the ODR register, that way you can toggle multiple bits independently and create a more elaborate trace with multiple pins if needed. eg.
#define RED_LED_ON GPIOB->BSRR = GPIO_BSRR_BS_14
#define RED_LED_OFF GPIOB->BSRR = GPIO_BSRR_BR_14
- have a look at the code at the assembly level, even step through the code using the software simulator (a great, yet underrated tool). You may not understand everything that is going on, but you should be able to see if the software is making "calls" to libraries instead of quicker inline instructions. Then you can experiment with the code and compiler settings etc. to see if you can get the calculation working better/faster.
2024-02-14 10:36 PM - edited 2024-02-14 10:37 PM
Use the cycle counter embedded into the ARM core (documentation on ARM website).
/* CPU cycle count activation for debugging */
#if DEBUG_CPU_TIMING
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->LAR = 0xC5ACCE55;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
#endif
usage:
#if DEBUG_CPU_TIMING
u32CycFuncStart = DWT->CYCCNT;
#endif
FunctionUnderTest();
#if DEBUG_CPU_TIMING
u32CycFuncStop = DWT->CYCCNT;
u32CycFuncSum[1] = u32CycFuncStop - u32CycFuncStart;
u64CycSumFunc[1] += (uint64_t)u32CycFuncSum[1];
if( u32CycFuncSum[1] > u32CycFuncMax[1] ) u32CycFuncMax[1] = u32CycFuncSum[1];
u32CycFuncCalls[1]++;
#endif
For even more accuracy, disable the ISRs with __disable_irq();
Have you enabled the hardware floating point unit (FPU) in the IDE? Somewhere in the project settings...
Edit: I would not call "sqrt(pi))" a simple calculation for real electronics HW. ;)
2024-02-14 11:49 PM
Do you have caches enabled?
A need to consider to (enable) use HW FP Unit, maybe with ARM DSP functions instead...?
Is the HW FPU enabled?
"At 216MHz , this is about 17400 clock cycles":
do you mean it takes approx. 80.5 micro-sec? (to see the GPIO toggling, yes, you have mentioned)
Assuming 4 MCU cycles as average for a single C-code instruction, and SQRT needs at least 100 cycles (?), you use it twice, plus the divisions, and all as double: you might have approx. 400 instructions * 4 cycles = 7.4 micro-sec.
"hmmm", a remarkable difference: 80.5 micro-sec. vs. 7.4 ms (for a half-cycle)
(or do you mean 2x 7.4 ms, for a full GPIO cycle?)
Do you run this code with best optimization (no -g3, -Og), with debug none and -O3?
If you want to fight for the best speed: also consider if you have DTCM, ITCM, to place variables and code there for fastest speed:
Especially, if you run the code from flash and it has this FLASH_LATENCY... Check, if you have ITCM and DTCM and optimize "memory layout" (move the code to ITCM and the variables, stack to DTCM).
For sure a bit unexpected bad results, but I think, still room for performance improvements on this MCU/board.
2024-03-03 08:00 PM
Thanx - very helpful info
2024-03-03 08:03 PM
Thanx- I was running off flash & not the ram
One the contrller was configured to run off RAM, most latencies vanished!