cancel
Showing results for 
Search instead for 
Did you mean: 

How do I calculate how many clock cycles are needed to run block of code?

LMorr.3
Senior II

Is it possible to calculate how many clock ticks a function or select block of code will use up?

9 REPLIES 9

Most STM32 you can use the DWT's CYCCNT

On CM0(+) perhaps use one of the TIM (TIM2 or TIM5 are 32-bit on some STM32)

volatile unsigned int *DWT_CYCCNT   = (volatile unsigned int *)0xE0001004; //address of the register
volatile unsigned int *DWT_CONTROL  = (volatile unsigned int *)0xE0001000; //address of the register
volatile unsigned int *DWT_LAR      = (volatile unsigned int *)0xE0001FB0; //address of the register
volatile unsigned int *SCB_DEMCR    = (volatile unsigned int *)0xE000EDFC; //address of the register
 
{
  uint32_t x, y;
  uint32_t Cycles;
 
  *SCB_DEMCR |= 0x01000000;
  *DWT_LAR = 0xC5ACCE55; // enable access
  *DWT_CYCCNT = 0; // reset the counter
  *DWT_CONTROL |= 1 ; // enable the counter
 
  x = *DWT_CYCCNT;
...
  y = *DWT_CYCCNT;
 
  Cycles = (y - x);
}

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Danish1
Lead III

In general the answer is no. It is easy to write code where the number of times it loops depends on some complicated function of the input-value(s) hence execution-time is equally varied.

And even without loops and branches (e.g. if-statements) it can be hard to get an exact number of cycles. Many stm32* have things like caches that reduce the number of cycles it takes to read or write to memory, so the total number of cycles depends whether or not the cache was able to help.

And the ARM core might not be the only thing accessing memory - there might be DMA fighting for access over the bus-matrix as well.

*But not the “simpler�? ones e.g. stm32l0, stm32f0

Having said all this, for a lot of code the number of cycles might only vary by less than (say) 10%. So an empirical approach of measuring it - as described above - is often good enough if you are only interested in the stm32 having enough time to complete its tasks, not using the execution-time as the basis for a delay.

(I remember seeing code for extremely simple microcontrollers without timer peripherals, where great effort went into ensuring each branch of possible program flow took exactly the same number of cycles. Things have improved since then.)

Hope this helps,

Danish

Microsoft added something similar to MASM back in the 6.X era, and I built several annotation tools for the MC68000 and 68020​ I was using at one point, being one of those engineers who writes software.

TBH I can do static code reduction in my head, and find dynamic analysis to be more fun when optimizing algorithms or complex system interactions.​

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Did now know about this, thanks!

Great info. thank you. Is there a reference showing how many clock cycles are used for additions, compares, etc? Maybe what I can do is assign 'much more time than is expected to be needed' for certain blocks of code to execute, and not worry so much about optimizing timing unless I run out of cycles. I'll post another question to freeRTOS with specifics on how I need to time my 3 Tasks, 1 for User Interface input, 1 to send out a pulse at a precise interval, and 1 to calculate values CCR, ARR, RCR and prescale registers for the 'next pulse'.

Since my app's timing is based on the accuracy/timing of 2 output pins, I have hooked up a 4 channel scope to see if periods/pulse widths are accurate and consistent over time. If I see jitter, I'll look at timing/clock cycles used by each task.

You don't mention a part, guessing an STM32F4 from some previous discussions.

ARM has technical reference manuals

Problem is, this is a pipelined processor, throughput is generally 1 cycle per instruction, the MUL/DIV are done in hardware, the LDR/STR are what generally take the time, or force in-order completion. Then you've got FLASH, line caches for that. The CM7 is super-scalar, so can take multiple instructions per cycle, depending on if units are busy, and the pipeline is longer, and there's architected caches.

This isn't an MCU from 1970's, so there isn't a neat decoder card showing 4 cycles for an instruction with an immediate, and 12 cycles for a more complex addressing mode variant.

Doesn't the TIM have shadow registers? Aren't those supposed to allow control of when the values are taken, rather than on-the-fly?

Transactions to the TIM on the APB are going to take in the order of 4-cycles.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

I'm using the TIM preload registers which works well. My app sends out a pulse at a specific user selected interval ( sequencer ). In order to send the pulse, user input must be read and used to calculate preload register values, just before the next pulse is sent, ensuring the latest user input is being used. My current solution is to trigger a freeRTOS task to read user input values and do calculations for preload values, 20ms before the pulse is sent. The fastest I will send pulses is every 60ms.

gregstm
Senior III

When optimising time critical assembly language, to get predictable timing, I have made sure the instructions are aligned to a 4 byte (word) boundary - and sometimes ensured all instructions are 32 bit long. It's the memory accesses that are more complicated, and if you are trying to save cycles, it is usually more efficient to load multiple registers with data with one instruction.

LMorr.3
Senior II

I'm also using the freeRTOS idleTaskHook to gauge idle time. from the docs:

"Measuring the amount of spare processing capacity. (The idle task will run only when all higher priority application tasks have no work to perform; so measuring the amount of processing time allocated to the idle task provides a clear indication of how much processing time is spare.)"