DMA Transfer Complete Flag strange behavior

stryga · ‎2024-02-22

Hi,

I am using 3 DMA channels of DMA1 on a STM32G4* to transfer data from the 3 ADCs to mem. At a certain point I want to make sure that all transfers are done and enter a while loop:

uint32_t timeoutCnt = 0;
while(!LL_DMA_IsActiveFlag_TC1(DMA1) || !LL_DMA_IsActiveFlag_TC2(DMA1) || !LL_DMA_IsActiveFlag_TC3(DMA1))
{
  if( ++timeoutCnt > 400 ) // 3 load-and-compare operations per cycle -> roughly guessed 70ns; wait for max 28µs
  {
    // error handling
  }
}
LL_DMA_ClearFlag_TC1(DMA1);
LL_DMA_ClearFlag_TC2(DMA1);
LL_DMA_ClearFlag_TC3(DMA1);
// go on with work...

Usually the DMAs should be done when execution gets here and in rare cases it may take some 100s of ns to complete.

This works most of the time. Sometimes the while loops gets stuck, that is one of the TC flags never gets set. There is no straight forward way to trigger the failure. Some software builds seem more susceptible and some hardware boards are more prone to show the error. Most builds and most boards are totally imune and run for days (>>10^9 passes) without problem. This indicates, that some subtle timing problem may be involved. (SW build moves the code around and different HW means that the timing of external interrupts and xtal frequency is slightly different.)

I can think of 2 reasons:

Some other event clears the TC bit before I do my test. As mentioned, in most cases the DMAs are long completed when I check them, so there would be ample time for "something else" to clear them. I just have no idea what "something else" could be.
The fast polling of the TC flag (and the bus traffic on AHB1 resulting from this polling) stalls the DMA transfer and actually keeps the TC from becoming set. Some sort of bus deadlock. But ADC is on AHB2 and the bus matrix uses a Round Robin arbitration, so I see no reason why the DMA should become stuck.

Any ideas or hints woud be highly appreciated!

Update: I am aware of the global clear bit (CGIFx) in DMA_IFCR - and I am very sure, that I never use this bit on DMAs in my code.

TDK · ‎2024-02-22

Probably a subtle code bug. Unlikely to be a silicon issue. Perhaps the transfer never gets started. Consider using a real timeout timer instead of a software based one, although I think the way you've written it is okay. Consider increasing the timeout to see if it eventually passes. Consider logging the start of each ADC transfer.

There are "while (!flag);" loops done all the time, this is probably not going to cause hardware issues. How much bandwidth are the ADCs using?

If you feel a post has answered your question, please click "Accept as Solution".

stryga · ‎2024-02-22

Thanks for your suggestions. ADCs are triggered every 60 to 100 µs, adc-clk is 42Mhz and sample time is 47.5 cy, translating to 1.12µs, so the SAR should have ample time to finish.
Well, in the beginning we had no timeout counter and back then the devices just stalled. So, I am quite sure that the situation never heals on its own.

The ADCs are triggered from a timer through trgo. I do not modify the timer config after init, so the trgo should be reliable. The code testing the completion is triggered through the same timer but then -> DMA (other channel but also DMA1) -> SPI -> DMA1 (again other channel) -> TC-interrupt.

Would you be aware of any "interference" between the different channels of DMA1?

waclawek.jan · ‎2024-02-22

When interrupt occurs, read out and check/post content of ADC and relevant DMA/DMAMUX registers.

JW

LCE · ‎2024-02-22

What's the ADC buffer size?
Assuming your count to 400 takes 400 * 3 cycles, that's just about 7 µs at 170 MHz.
Maybe your check sometimes gets called directly after a new DMA transfer was started?

Is it always the same DMA channel that gets "stuck"?

Do you actually do anything where your comment "// error handling" is?

In case of failure I would check which DMA channel is still active, set a flag, break the loop, and so on...

TDK · ‎2024-02-22

> The code testing the completion is triggered through the same timer but then -> DMA (other channel but also DMA1) -> SPI -> DMA1 (again other channel) -> TC-interrupt.

Makes me wonder if you're clearing flags incorrectly in the other channel. Are you doing an improper read-modify-write to clear flags on the SPI side?

Following @waclawek.jan's suggestion would likely show the issue.

If you feel a post has answered your question, please click "Accept as Solution".

stryga · ‎2024-02-26

Thank you for the input.

Waiting for longer doesn't help. We had 2000 wait cycle for some time, no difference.
The logging says that usually all 3 channels are stuck - strange. I have to double check how I log it.
Error handling means going to safe state as long as we have no clear understanding what happens here. Technically we could just go on and the next ADC cycle has good chances to complete without error. Still. "cleverly ignoring" the error doesn't feel like a solution.

stryga · ‎2024-02-26

Flags are cleared through HAL calls to LL_DMA_ClearFlag_TC... - since the DMAs have a dedicated clear-register (IFCR), there is no read-modify-write. You just write a 1 at the flag you want to clear, 0s do nothing.

I'll collect more data and come back, thanx so far.

LCE · ‎2024-02-26

> "cleverly ignoring" the error doesn't feel like a solution.

Haha, that's a good one, and I'm absolutely with you!