STM32F4: Worst-case latency timer to DMA?
I am using TIM8 to generate a PWM waveform (a /wr signal to an LCD ST7789 controller) and the timer triggers DMA2 to load the next byte onto the PD8-PD15 bus. The 1st byte has to be loaded “manually”.
It works, and the ARR CCR3 CCR4 values can be adjusted to get the desired overall period, and data setup and hold times.
At 84MHz APB2, I am using ARR=11 to get a 71ns cycle time. So this thing is running pretty fast. The timer resolution is 1/168MHz. The ST7789 can go down to 66ns.
The problem is that roughly once in a million writes, bad data is loaded. It shows up as wrong colour pixels. With extensive debugging, AFAICT, the bad data is not present in the source buffer. What I do see on a scope is very rare faint edges on the data, around the /wr +ve edge, on which the data setup/hold time is out of spec. Sometimes the data changes at the same time as the /wr +ve edge!
Using ARR=15 or higher, the problem goes away. But then I get too-slow transfers.
It looks like there is a very rare huge latency in the DMA transfer, so the data on the PD8-PD15 bus is still the previous value. The correct number of writes takes place however, as only the one pixel (2 bytes, 16 bit RGB) is corrupted. In fact you need a magnifying glass to see the “snow”, on a 240x320 LCD.
To some extent the minimum ARR value needed to remove the “snow” depends on what else is running. DMA1 is also in use, for feeding SPI3, under another RTOS task. Then ETH has its own DMA controller. So there is opportunity for bus contention.
The really funny thing is that the “snow” pixels pop up 3-5x, at roughly 3Hz (on a 20fps display redraw) and then vanish for a few secs, then repeat. In the same place. So whatever is causing this is somehow synchronous to what DMA2 is doing. DMA2 is not used for anything else but DMA1 is.
What is the worst-case latency, and is there anything I can do to improve it? There is a priority setting but it applies only within DMA2 channels. It looks like very rarely the latency might be tens of ns.
The code is below. Start transfer, and init is below that.
// TIM8 + DMA hardware /WR. CPU is idle during each burst.
//
// PC8 is a GPIO output (idle high) on entry. Flip it to AF3 (TIM8_CH3) for
// the pixel burst, then back to GPIO output afterwards so the next column's
// address setup can bit-bang /WR.
//
// Stream the buffer in <=256-byte bursts (RCR is 8-bit). For each burst:
// prime byte 0 onto PD8-15 (pulse 0 latches it), point DMA at bytes 1..n-1,
// then start TIM8. The CC4 compare event loads each subsequent byte while
// /WR is high so it settles ~54ns before the next latch.
//
// TWO RACE FIXES for the rare wrong/short filled-rectangle line:
// 1. __DSB() after EACH prime write (inside the loop) forces byte 0 onto
// the bus before the timer can latch it - covers byte 0 of every
// <=256 chunk, including the 2nd+ chunks of a >256 transfer.
// 2. After the timer self-stops (CEN clear), WAIT for the DMA transfer-
// complete flag (TCIF7) BEFORE disabling the stream. The timer stopping
// and the DMA draining its last byte are SEPARATE events; tearing the
// stream down on CEN-clear alone can catch the DMA still draining ->
// a whole line comes out short or wrong. This closes that race.
//
// HANG SAFETY: every spin-wait below is bounded by a ~100k iteration guard
// so a missed timer-stop or a DMA that errors (TEIF7/DMEIF7/FEIF7 set, but
// TCIF7 never set) degrades to a dropped/short line instead of hanging the
// LCD task for ever. 100k iterations >> the worst-case 256-byte burst
// (~20us), so the guards never trip in normal operation. The TCIF7 wait
// also exits on any DMA error flag, not just on completion.
LCD_DC_GPIO_PORT->BSRR = LCD_DC_PIN; // DAT: D/C=1
lcd_cs(0);
// PC8 -> AF3 (TIM8_CH3): the timer drives /WR for the DMA burst.
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (2u << (8*2)); // PC8 = AF
uint32_t off = 0;
while (off < size)
{
uint32_t n = size - off;
if (n > 256u) n = 256u;
const uint8_t *b = &outbuf[off];
// CC4 DMA request OFF while we set up.
TIM8->DIER &= ~TIM_DIER_CC4DE;
// Prime byte 0, then a data barrier so the store lands before the
// timer can latch it (covers byte 0 of THIS chunk).
*((volatile uint8_t *)(&GPIOD->ODR) + 1) = b[0];
__DSB();
// Arm DMA for bytes 1..n-1 (none if n==1). CC4 in period k loads byte k+1.
if (n > 1u)
{
uint32_t guard;
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
guard = 100000u;
while ((DMA2_Stream7->CR & DMA_SxCR_EN) && --guard) { }
if (guard == 0u) g_lcd_par8_dma_timeout++;
DMA2->HIFCR = (DMA_HIFCR_CTCIF7 | DMA_HIFCR_CHTIF7 |
DMA_HIFCR_CTEIF7 | DMA_HIFCR_CDMEIF7 | DMA_HIFCR_CFEIF7);
DMA2_Stream7->M0AR = (uint32_t)&b[1];
DMA2_Stream7->NDTR = n - 1u;
DMA2_Stream7->CR |= DMA_SxCR_EN;
}
// Load RCR (n pulses) via UG with CC4DE OFF, then clear UIF/CC4IF.
TIM8->RCR = n - 1u;
TIM8->EGR = TIM_EGR_UG;
TIM8->SR = 0;
// Clear any stale CC4 flag FIRST, THEN enable CC4 DMA, then start.
// (Enabling CC4DE while CC4IF is still set from the previous burst's
// last period fires an immediate spurious DMA request -> the whole
// burst shifts by one byte -> a wrong-colour run on that row.)
TIM8->SR = 0;
if (n > 1u) TIM8->DIER |= TIM_DIER_CC4DE;
TIM8->CR1 |= TIM_CR1_CEN;
// Wait for OPM to self-stop the timer (bounded). This is the guard that
// also covers the n==1 case (single primed byte, no DMA armed): it
// ensures the one OPM pulse has latched before teardown / PC8 flip.
uint32_t guard = 100000u;
while ((TIM8->CR1 & TIM_CR1_CEN) && --guard) { }
if (guard == 0u) g_lcd_par8_dma_timeout++;
// THEN wait for the DMA to finish draining its last byte before tearing
// the stream down - otherwise the timer-stop vs DMA-drain race can
// truncate the line. Exit on completion OR any DMA error (so an errored
// transfer, which never sets TCIF7, cannot hang here). Bounded by guard.
// (Skip if n==1: no DMA was armed.)
if (n > 1u)
{
uint32_t guard = 100000u;
while (!(DMA2->HISR & (DMA_HISR_TCIF7 | DMA_HISR_TEIF7 |
DMA_HISR_DMEIF7 | DMA_HISR_FEIF7)) && --guard) { }
if (guard == 0u) g_lcd_par8_dma_timeout++;
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
}
TIM8->DIER &= ~TIM_DIER_CC4DE;
off += n;
}
// Very rarely the above PWM gen leaves /wr=0 so setting it to 1 below creates a +ve edge.
// Burst(s) done. OPM stops TIM8 at CNT=0, which is in the LOW half of the
// PWM mode-2 cycle, so CH3 leaves /WR (PC8, still in AF) driven LOW. If we
// flipped PC8 to GPIO-output-high here, the AF->GPIO handoff would drive
// /WR low->high while /CS is STILL LOW - a spurious rising edge, i.e. one
// extra pixel latched at the end of every burst (the end-of-transaction
// edges seen on the scope). So deselect FIRST: raise /CS while /WR is still
// (harmlessly) low, THEN bring /WR back to idle-high. The rising edge from
// the mode switch now happens with /CS already high and cannot latch.
lcd_cs(1); // /CS = 1 FIRST (deselect)
GPIOC->BSRR = WR_C_HIGH; // PC8 ODR = 1
__DSB(); // /WR-high store committed before PC8 leaves AF
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (1u << (8*2)); // PC8 -> GPIO output (now follows ODR = high)
__DSB(); // ensure PC8 is GPIO before any following bit-bang
/*
═══════════════════════════════════════════════════════════════════════════════
CC4-triggered DMA: load the next byte AFTER /WR has gone high, so it settles
during the high phase + next low phase and is rock-solid before the next latch.
═══════════════════════════════════════════════════════════════════════════════
PRINCIPLE
CH3 (CCR3) generates /WR via PWM mode 2: /WR low CNT 0..CCR3-1, high CNT CCR3..ARR.
Rising edge (the LATCH) is at CNT = CCR3.
CH4 (CCR4) is a DMA-TRIGGER ONLY channel (no output pin). Set CCR4 a few ticks
AFTER CCR3, so the CC4 event - and thus the DMA load of the next byte - happens
while /WR is already HIGH, a few cycles past the latch edge.
The byte loaded at CNT=CCR4 then has the rest of this period plus the next
period's low phase to settle before the NEXT rising edge at CNT=CCR3. No race.
CC-compare DMA requests are NOT gated by the repetition counter (unlike UPDATE),
so CC4 fires every period -> one byte per period, exactly what we want, while RCR
still bounds the burst length via OPM.
DMA MAPPING: TIM8_CH4 -> DMA2 Stream7, Channel 7.
(TIM8_UP was Stream1 Ch7; CC4 is a different stream. Stream7/Ch7 = TIM8_CH4/TRIG/COM.)
NOTE: PC8 is re-pointed to AF3 here via the raw AFR mux, but its MODER is left
as OUTPUT (set by HAL_GPIO_Init above). The DAT burst flips MODER to AF only
for the pixel stream and back, so the bit-bang init/address setup keep PC8 as
a GPIO output. The slew (OSPEEDR) set above carries over to the AF use.
A lot of time was spent on ARR CCR CCR4 values. There is a DMA trigger delay from
CCR4 to the DMA actually doing it, causing it to sometimes read what is probably the
same byte again, causing specs on the display. The probability is of the order of
1 in 100k. One needs to test with USB VCP and ETH all running. There is still
a little bit of snow with these settings but not readily visible.
*/
__HAL_RCC_TIM8_CLK_ENABLE();
__HAL_RCC_DMA2_CLK_ENABLE();
// Pre-select AF3 for PC8 in the mux; leave MODER as OUTPUT (set above).
GPIOC->AFR[1] = (GPIOC->AFR[1] & ~(0xFu << ((WR_PIN_POS-8)*4)))
| (3u << ((WR_PIN_POS-8)*4)); // AF3
// Timer base
TIM8->PSC = 0;
TIM8->ARR = 11; // ARR=11 is 12 ticks ≈ 71ns (min ST7789 cycle is 66ns)
TIM8->RCR = 0; // set per-burst
TIM8->CR1 |= TIM_CR1_OPM; // one-pulse: self-stop after burst
// CH3 = /WR output, PWM mode 2 (low-first). Latch (rising edge) at CNT=CCR3.
TIM8->CCR3 = 8;
TIM8->CCMR2 &= ~TIM_CCMR2_OC3M;
TIM8->CCMR2 |= (7u << TIM_CCMR2_OC3M_Pos); // PWM mode 2
TIM8->CCMR2 |= TIM_CCMR2_OC3PE; // preload
TIM8->CCER |= TIM_CCER_CC3E; // CH3 output enable (drives PC8 in AF)
TIM8->BDTR |= TIM_BDTR_MOE; // main output enable
// CH4 = DMA-trigger only (no output pin). Compare at CCR4 generates the CC4
// event -> CC4 DMA request, loading the next byte while /WR is high.
TIM8->CCR4 = 9; // a few ticks after CCR3 (the latch)
TIM8->CCMR2 &= ~TIM_CCMR2_OC4M; // OC4M=0 (frozen) - still sets CC4IF on match
// (no CC4E - we don't need an output pin, just the compare event/flag)
// DMA requests from CC4 (NOT from update). UDE off; CC4DE on per-burst.
TIM8->DIER &= ~TIM_DIER_UDE;
// CC4DE toggled per-burst in LCD_Transmit_buf_DAT (kept off here).
// DMA2 Stream7 Ch7 = TIM8_CH4. byte->byte, mem-increment, mem->periph, PAR=ODR+1.
DMA2_Stream7->CR = 0;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2_Stream7->PAR = (uint32_t)(&GPIOD->ODR) + 1u; // high byte -> PD8-15
DMA2_Stream7->CR =
(7u << DMA_SxCR_CHSEL_Pos) // channel 7 = TIM8_CH4
| (0u << DMA_SxCR_MSIZE_Pos) // memory size = byte
| (0u << DMA_SxCR_PSIZE_Pos) // periph size = byte
| DMA_SxCR_MINC // increment memory
| (1u << DMA_SxCR_DIR_Pos) // memory -> peripheral
| (2u << DMA_SxCR_PL_Pos); // priority high
TIM8->EGR = TIM_EGR_UG; // load shadow regs (timer idle)
GPIOC->BSRR = WR_C_HIGH; // PC8 = 1 (idle high), stays GPIO output
Start the transfer:
