cancel
Showing results for 
Search instead for 
Did you mean: 

LL_TIM_*() calls take too much time.

schperplata
Associate III

I am having a problem with a timer. I use CubeMX LL drivers, with basic time-base timer configuration.

STM32L100RCT6, TIM6 and TIM7. Core Clock = 32MHz, PCLK1 (TIM6 and TIM7) = 32MHz

Prescaller = 31, which should set timer to tick at 1us. ARR value is set to max, 65535, interrupts are disabled.

Although my timer does tick at 1us, starting and stopping timer adds another few microsecond, which is unacceptable for my application. I added GPIO writes (directly using registers) and check timings with logic analyzer.

Here is the pseudo-code of my test:

// start timer
<GPIO HIGH>
LL_TIM_SetCounter(TIM6, 0);
<GPIO LOW>
LL_TIM_EnableCounter(TIM6);
<GPIO HIGH>
while(LL_TIM_IsEnabledCounter(TIM6) == 0) {};
<GPIO LOW>
 
// poll timer until value match
while(getUsDelayTimerValue() < 99) {};
<GPIO HIGH>
 
// stop timer
LL_TIM_DisableCounter(TIM6);
<GPIO LOW>

I assume that any of LL_TIM_x call would be in a range of a few nanoseconds, or at least far below us range, since they mostly resolve into single register writes. The problem is, that in reality, this calls take way more time than they should:

// start timer
LL_TIM_SetCounter(TIM6, 0); -> 250 ns
LL_TIM_EnableCounter(TIM6); -> 291.67 ns
while(LL_TIM_IsEnabledCounter(TIM6) == 0){}; -> 583.33 ns
 
// poll timer until value match
while(getUsDelayTimerValue() < 99){} -> 100.542 us
 
// stop timer
LL_TIM_DisableCounter(TIM6); -> 791.67 ns

... which altogether takes: 102.458 us, instead of 100 us as expected.

What am I doing wrong or is there a thing I missed?

This is the code CubeMX generate for timer initialization:

void MX_TIM6_Init(void){
 LL_TIM_InitTypeDef TIM_InitStruct = {0};
 /* Peripheral clock enable */
 LL_APB1_GRP1_EnableClock(LL_APB1_GRP1_PERIPH_TIM6);
 
/* TIM6 interrupt Init */
 NVIC_SetPriority(TIM6_IRQn, NVIC_EncodePriority(NVIC_GetPriorityGrouping(),1, 0));
 NVIC_EnableIRQ(TIM6_IRQn);
 
TIM_InitStruct.Prescaler = 0;
TIM_InitStruct.CounterMode = LL_TIM_COUNTERMODE_UP;
TIM_InitStruct.Autoreload = 0;
LL_TIM_Init(TIM6, &TIM_InitStruct);
LL_TIM_DisableARRPreload(TIM6);
LL_TIM_SetTriggerOutput(TIM6, LL_TIM_TRGO_RESET);
LL_TIM_DisableMasterSlaveMode(TIM6);
}

And this is how I further init timer prescaller and autoreload.

void initTimer(void)
{
  NVIC_ClearPendingIRQ(TIM6_IRQn);
  LL_RCC_GetSystemClocksFreq(&clocks);
  uint32_t prescaller = US_DELAY_TIMER_TICK * clocks.PCLK1_Frequency / 1e6 - 1;
  assert_param(prescaller < 65535);
  LL_TIM_SetPrescaler(TIM6, prescaller);
  
  LL_TIM_SetAutoReload(TIM6, 65535);
  LL_TIM_ClearFlag_UPDATE(TIM6);
  LL_TIM_EnableIT_UPDATE(TIM6);
  LL_TIM_GenerateEvent_UPDATE(TIM6); // prescaller values is updated at the next update event
  _waitUntilUpdate(TIM6);
  LL_TIM_ClearFlag_UPDATE(TIM6);
 
  NVIC_DisableIRQ(TIM6_IRQn);
}

0690X00000BvalaQAB.png

1 ACCEPTED SOLUTION

Accepted Solutions
berendi
Principal

These timing seem realistic to me, except the last one with 791 ns.

Consider a sequence of flipping a bit in a control register, and setting the PA6 pin high

TIM6->CR1 |= TIM_CR1_CEN;
GPIOA->BSRR = 1<<6;

Assuming that both peripherals were accessed recently, so their addresses are already held in registers (e.g. r2=TIM6, r3=GPIOA), this would translate to 5 machine code instructions (if optimizations are turned on)

ldr r0, [r2]
orr r0, #1
str r0, [r2]
mov r0, #64
str r0, [r3,#0x18]

loads and stores can take 2 cycles, mov and orr each take 1 cycle, flash latency could stall it a bit too, so 9 cycles seem like a worst case scenario but still plausible. If you'd drop the GPIO stuff, you can excpect almost the half of the overhead to go away.

The last operation taking 791 ns is puzzling, perhaps you can take a look at the disassembly.

You might be able to shave a few cycles off here and there. First of all, use neither HAL nor LL especially when there are tight timing requirements. Note that in the above example, all other bits of CR1 are 0, so writing

TIM6->CR1 = TIM_CR1_CEN;

instead of what LL_TIM_EnableCounter() does makes the time-consuming load from a peripheral register unnecessary, saving 2 or maybe 3 cycles due to pipelining. Using the HAL_ or LL_ functions makes spotting these kinds of unnecessary operations harder. (Is there any possible benefit of using LL?) Check every operation in the reference manual to see whether it's necessary at all. There is no need to check that the counter is enabled after enabling it, so here goes another 500 ns of overhead. You don't even have to start and stop the timer every time, just leave it running, and reset the counter whenever the start of an interval has to be marked.

It should be possible to have an overhead of about 400-600 ns, but I don't think it would go below that. If that's still unacceptable, consider instead what do you need the delay for in the first place. Can perhaps a timer do the task on its own, or trigger a DMA transfer or two to do the job?

View solution in original post

5 REPLIES 5
berendi
Principal

These timing seem realistic to me, except the last one with 791 ns.

Consider a sequence of flipping a bit in a control register, and setting the PA6 pin high

TIM6->CR1 |= TIM_CR1_CEN;
GPIOA->BSRR = 1<<6;

Assuming that both peripherals were accessed recently, so their addresses are already held in registers (e.g. r2=TIM6, r3=GPIOA), this would translate to 5 machine code instructions (if optimizations are turned on)

ldr r0, [r2]
orr r0, #1
str r0, [r2]
mov r0, #64
str r0, [r3,#0x18]

loads and stores can take 2 cycles, mov and orr each take 1 cycle, flash latency could stall it a bit too, so 9 cycles seem like a worst case scenario but still plausible. If you'd drop the GPIO stuff, you can excpect almost the half of the overhead to go away.

The last operation taking 791 ns is puzzling, perhaps you can take a look at the disassembly.

You might be able to shave a few cycles off here and there. First of all, use neither HAL nor LL especially when there are tight timing requirements. Note that in the above example, all other bits of CR1 are 0, so writing

TIM6->CR1 = TIM_CR1_CEN;

instead of what LL_TIM_EnableCounter() does makes the time-consuming load from a peripheral register unnecessary, saving 2 or maybe 3 cycles due to pipelining. Using the HAL_ or LL_ functions makes spotting these kinds of unnecessary operations harder. (Is there any possible benefit of using LL?) Check every operation in the reference manual to see whether it's necessary at all. There is no need to check that the counter is enabled after enabling it, so here goes another 500 ns of overhead. You don't even have to start and stop the timer every time, just leave it running, and reset the counter whenever the start of an interval has to be marked.

It should be possible to have an overhead of about 400-600 ns, but I don't think it would go below that. If that's still unacceptable, consider instead what do you need the delay for in the first place. Can perhaps a timer do the task on its own, or trigger a DMA transfer or two to do the job?

S.Ma
Principal

To me the implementation is wrong and will cause lots of debug and side effects.

If the SW timings is nearly to its limit, this is a red alert.

Say you implement USB with interrupts at top priority, then your implementation will malfunction randomly.

Better use the timer(s) HW features effectively. Use HW resources as much as possible and make sure the SW has enough margins for all that is running on the core.

However it is still possible to learn this the long and hard way. :D

And you'll start to discover your code is compiler optimisation dependent. Go figure...

schperplata
Associate III

You don't even have to start and stop the timer every time, just leave it running, and reset the counter whenever the start of an interval has to be marked.

Both of you are right: this is not the right approach and what I reported here is actually only a test scenario, because some timings were strange to me. I did as berendi suggested, I have a free running timer and just check current values. I just wan't to clarify this issue.

Hovever, I checked dissassembly code as you suggested and I counted roughly 40 + n*4 instructions, which, at 32MHz adds 1.4us (without while loop). I imagine some instructions are not single cycle and I didn't bother with counting branching... So I guess it is actually as expected.

So, lesson learned: even a single GPIO write to register can take a hundred ns (register write: 3 instructions, @ 32MHz -> 93.75 ns).

Thank you.

BTW: doesn't LL_ library do the bare minimum (unlike HAL_), withou any redundant code? For example:

LL_TIM_SetCounter(TIM6, 0);

is actually:

__STATIC_INLINE void LL_TIM_SetCounter(TIM_TypeDef *TIMx, uint32_t Counter)
{
  WRITE_REG(TIMx->CNT, Counter);
}
 
// where WRITE_REG is:
#define WRITE_REG(REG, VAL)   ((REG) = (VAL))

I would think this is as optimized as it gets. Am I wrong?

There are some LL functions that do the bare minimum, some that add a bit of unnecessary processing (like LL_TIM_EnableCounter() above) that counts when things should happen fast, like in an interrupt handler, and some are horribly bloated, look at timer channel (IC/OC/Encoder) configuration in LL.

My question is rather what is the advantage of writing LL_TIM_EnableCounter()  instead of TIMx->CR1 |= TIM_CR1_CEN ? Why does someone choose a "library" of poorly documented one-liner functions instead of an extensively documented register interface, if they indeed do the same?

My question is rather what is the advantage of writing LL_TIM_EnableCounter() instead of TIMx->CR1 |= TIM_CR1_CEN ? Why does someone choose a "library" of poorly documented one-liner functions instead of an extensively documented register interface, if they indeed do the same?

Well, that is simple to answer: because TIM_EnableCounter can be understood by human, while TIMx->CR1 |= TIM_CR1_CEN is forgotten after 2 minutes :). And a bunch of other pros, but I do agree, bad library is worse than raw register writes.

Anyway, thanks for your help, really appreciate it!