Skip to main content
PHolt.1
Senior
June 22, 2026
Question

STM32F4: Worst-case latency timer to DMA?

  • June 22, 2026
  • 7 replies
  • 60 views

I am using TIM8 to generate a PWM waveform (a /wr signal to an LCD ST7789 controller) and the timer triggers DMA2 to load the next byte onto the PD8-PD15 bus. The 1st byte has to be loaded “manually”.

It works, and the ARR CCR3 CCR4 values can be adjusted to get the desired overall period, and data setup and hold times.

At 84MHz APB2, I am using ARR=11 to get a 71ns cycle time. So this thing is running pretty fast. The timer resolution is 1/168MHz. The ST7789 can go down to 66ns.

The problem is that roughly once in a million writes, bad data is loaded. It shows up as wrong colour pixels. With extensive debugging, AFAICT, the bad data is not present in the source buffer. What I do see on a scope is very rare faint edges on the data, around the /wr +ve edge, on which the data setup/hold time is out of spec. Sometimes the data changes at the same time as the /wr +ve edge!

Using ARR=15 or higher, the problem goes away. But then I get too-slow transfers.

It looks like there is a very rare huge latency in the DMA transfer, so the data on the PD8-PD15 bus is still the previous value. The correct number of writes takes place however, as only the one pixel (2 bytes, 16 bit RGB) is corrupted. In fact you need a magnifying glass to see the “snow”, on a 240x320 LCD.

To some extent the minimum ARR value needed to remove the “snow” depends on what else is running. DMA1 is also in use, for feeding SPI3, under another RTOS task. Then ETH has its own DMA controller. So there is opportunity for bus contention.

The really funny thing is that the “snow” pixels pop up 3-5x, at roughly 3Hz (on a 20fps display redraw) and then vanish for a few secs, then repeat. In the same place. So whatever is causing this is somehow synchronous to what DMA2 is doing. DMA2 is not used for anything else but DMA1 is.

What is the worst-case latency, and is there anything I can do to improve it? There is a priority setting but it applies only within DMA2 channels. It looks like very rarely the latency might be tens of ns.

The code is below. Start transfer, and init is below that.

	// TIM8 + DMA hardware /WR. CPU is idle during each burst.
//
// PC8 is a GPIO output (idle high) on entry. Flip it to AF3 (TIM8_CH3) for
// the pixel burst, then back to GPIO output afterwards so the next column's
// address setup can bit-bang /WR.
//
// Stream the buffer in <=256-byte bursts (RCR is 8-bit). For each burst:
// prime byte 0 onto PD8-15 (pulse 0 latches it), point DMA at bytes 1..n-1,
// then start TIM8. The CC4 compare event loads each subsequent byte while
// /WR is high so it settles ~54ns before the next latch.
//
// TWO RACE FIXES for the rare wrong/short filled-rectangle line:
// 1. __DSB() after EACH prime write (inside the loop) forces byte 0 onto
// the bus before the timer can latch it - covers byte 0 of every
// <=256 chunk, including the 2nd+ chunks of a >256 transfer.
// 2. After the timer self-stops (CEN clear), WAIT for the DMA transfer-
// complete flag (TCIF7) BEFORE disabling the stream. The timer stopping
// and the DMA draining its last byte are SEPARATE events; tearing the
// stream down on CEN-clear alone can catch the DMA still draining ->
// a whole line comes out short or wrong. This closes that race.
//
// HANG SAFETY: every spin-wait below is bounded by a ~100k iteration guard
// so a missed timer-stop or a DMA that errors (TEIF7/DMEIF7/FEIF7 set, but
// TCIF7 never set) degrades to a dropped/short line instead of hanging the
// LCD task for ever. 100k iterations >> the worst-case 256-byte burst
// (~20us), so the guards never trip in normal operation. The TCIF7 wait
// also exits on any DMA error flag, not just on completion.
LCD_DC_GPIO_PORT->BSRR = LCD_DC_PIN; // DAT: D/C=1
lcd_cs(0);

// PC8 -> AF3 (TIM8_CH3): the timer drives /WR for the DMA burst.
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (2u << (8*2)); // PC8 = AF

uint32_t off = 0;
while (off < size)
{
uint32_t n = size - off;
if (n > 256u) n = 256u;
const uint8_t *b = &outbuf[off];

// CC4 DMA request OFF while we set up.
TIM8->DIER &= ~TIM_DIER_CC4DE;

// Prime byte 0, then a data barrier so the store lands before the
// timer can latch it (covers byte 0 of THIS chunk).
*((volatile uint8_t *)(&GPIOD->ODR) + 1) = b[0];
__DSB();

// Arm DMA for bytes 1..n-1 (none if n==1). CC4 in period k loads byte k+1.
if (n > 1u)
{
uint32_t guard;
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
guard = 100000u;
while ((DMA2_Stream7->CR & DMA_SxCR_EN) && --guard) { }
if (guard == 0u) g_lcd_par8_dma_timeout++;
DMA2->HIFCR = (DMA_HIFCR_CTCIF7 | DMA_HIFCR_CHTIF7 |
DMA_HIFCR_CTEIF7 | DMA_HIFCR_CDMEIF7 | DMA_HIFCR_CFEIF7);
DMA2_Stream7->M0AR = (uint32_t)&b[1];
DMA2_Stream7->NDTR = n - 1u;
DMA2_Stream7->CR |= DMA_SxCR_EN;
}

// Load RCR (n pulses) via UG with CC4DE OFF, then clear UIF/CC4IF.
TIM8->RCR = n - 1u;
TIM8->EGR = TIM_EGR_UG;
TIM8->SR = 0;

// Clear any stale CC4 flag FIRST, THEN enable CC4 DMA, then start.
// (Enabling CC4DE while CC4IF is still set from the previous burst's
// last period fires an immediate spurious DMA request -> the whole
// burst shifts by one byte -> a wrong-colour run on that row.)
TIM8->SR = 0;
if (n > 1u) TIM8->DIER |= TIM_DIER_CC4DE;
TIM8->CR1 |= TIM_CR1_CEN;

// Wait for OPM to self-stop the timer (bounded). This is the guard that
// also covers the n==1 case (single primed byte, no DMA armed): it
// ensures the one OPM pulse has latched before teardown / PC8 flip.

uint32_t guard = 100000u;
while ((TIM8->CR1 & TIM_CR1_CEN) && --guard) { }
if (guard == 0u) g_lcd_par8_dma_timeout++;

// THEN wait for the DMA to finish draining its last byte before tearing
// the stream down - otherwise the timer-stop vs DMA-drain race can
// truncate the line. Exit on completion OR any DMA error (so an errored
// transfer, which never sets TCIF7, cannot hang here). Bounded by guard.
// (Skip if n==1: no DMA was armed.)
if (n > 1u)
{
uint32_t guard = 100000u;
while (!(DMA2->HISR & (DMA_HISR_TCIF7 | DMA_HISR_TEIF7 |
DMA_HISR_DMEIF7 | DMA_HISR_FEIF7)) && --guard) { }
if (guard == 0u) g_lcd_par8_dma_timeout++;
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
}

TIM8->DIER &= ~TIM_DIER_CC4DE;

off += n;
}

// Very rarely the above PWM gen leaves /wr=0 so setting it to 1 below creates a +ve edge.
// Burst(s) done. OPM stops TIM8 at CNT=0, which is in the LOW half of the
// PWM mode-2 cycle, so CH3 leaves /WR (PC8, still in AF) driven LOW. If we
// flipped PC8 to GPIO-output-high here, the AF->GPIO handoff would drive
// /WR low->high while /CS is STILL LOW - a spurious rising edge, i.e. one
// extra pixel latched at the end of every burst (the end-of-transaction
// edges seen on the scope). So deselect FIRST: raise /CS while /WR is still
// (harmlessly) low, THEN bring /WR back to idle-high. The rising edge from
// the mode switch now happens with /CS already high and cannot latch.
lcd_cs(1); // /CS = 1 FIRST (deselect)

GPIOC->BSRR = WR_C_HIGH; // PC8 ODR = 1
__DSB(); // /WR-high store committed before PC8 leaves AF
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (1u << (8*2)); // PC8 -> GPIO output (now follows ODR = high)
__DSB(); // ensure PC8 is GPIO before any following bit-bang
		/*
═══════════════════════════════════════════════════════════════════════════════
CC4-triggered DMA: load the next byte AFTER /WR has gone high, so it settles
during the high phase + next low phase and is rock-solid before the next latch.
═══════════════════════════════════════════════════════════════════════════════

PRINCIPLE
CH3 (CCR3) generates /WR via PWM mode 2: /WR low CNT 0..CCR3-1, high CNT CCR3..ARR.
Rising edge (the LATCH) is at CNT = CCR3.
CH4 (CCR4) is a DMA-TRIGGER ONLY channel (no output pin). Set CCR4 a few ticks
AFTER CCR3, so the CC4 event - and thus the DMA load of the next byte - happens
while /WR is already HIGH, a few cycles past the latch edge.
The byte loaded at CNT=CCR4 then has the rest of this period plus the next
period's low phase to settle before the NEXT rising edge at CNT=CCR3. No race.

CC-compare DMA requests are NOT gated by the repetition counter (unlike UPDATE),
so CC4 fires every period -> one byte per period, exactly what we want, while RCR
still bounds the burst length via OPM.

DMA MAPPING: TIM8_CH4 -> DMA2 Stream7, Channel 7.
(TIM8_UP was Stream1 Ch7; CC4 is a different stream. Stream7/Ch7 = TIM8_CH4/TRIG/COM.)

NOTE: PC8 is re-pointed to AF3 here via the raw AFR mux, but its MODER is left
as OUTPUT (set by HAL_GPIO_Init above). The DAT burst flips MODER to AF only
for the pixel stream and back, so the bit-bang init/address setup keep PC8 as
a GPIO output. The slew (OSPEEDR) set above carries over to the AF use.

A lot of time was spent on ARR CCR CCR4 values. There is a DMA trigger delay from
CCR4 to the DMA actually doing it, causing it to sometimes read what is probably the
same byte again, causing specs on the display. The probability is of the order of
1 in 100k. One needs to test with USB VCP and ETH all running. There is still
a little bit of snow with these settings but not readily visible.
*/

__HAL_RCC_TIM8_CLK_ENABLE();
__HAL_RCC_DMA2_CLK_ENABLE();

// Pre-select AF3 for PC8 in the mux; leave MODER as OUTPUT (set above).
GPIOC->AFR[1] = (GPIOC->AFR[1] & ~(0xFu << ((WR_PIN_POS-8)*4)))
| (3u << ((WR_PIN_POS-8)*4)); // AF3

// Timer base
TIM8->PSC = 0;
TIM8->ARR = 11; // ARR=11 is 12 ticks ≈ 71ns (min ST7789 cycle is 66ns)
TIM8->RCR = 0; // set per-burst

TIM8->CR1 |= TIM_CR1_OPM; // one-pulse: self-stop after burst

// CH3 = /WR output, PWM mode 2 (low-first). Latch (rising edge) at CNT=CCR3.
TIM8->CCR3 = 8;
TIM8->CCMR2 &= ~TIM_CCMR2_OC3M;
TIM8->CCMR2 |= (7u << TIM_CCMR2_OC3M_Pos); // PWM mode 2
TIM8->CCMR2 |= TIM_CCMR2_OC3PE; // preload
TIM8->CCER |= TIM_CCER_CC3E; // CH3 output enable (drives PC8 in AF)
TIM8->BDTR |= TIM_BDTR_MOE; // main output enable

// CH4 = DMA-trigger only (no output pin). Compare at CCR4 generates the CC4
// event -> CC4 DMA request, loading the next byte while /WR is high.
TIM8->CCR4 = 9; // a few ticks after CCR3 (the latch)
TIM8->CCMR2 &= ~TIM_CCMR2_OC4M; // OC4M=0 (frozen) - still sets CC4IF on match
// (no CC4E - we don't need an output pin, just the compare event/flag)

// DMA requests from CC4 (NOT from update). UDE off; CC4DE on per-burst.
TIM8->DIER &= ~TIM_DIER_UDE;
// CC4DE toggled per-burst in LCD_Transmit_buf_DAT (kept off here).

// DMA2 Stream7 Ch7 = TIM8_CH4. byte->byte, mem-increment, mem->periph, PAR=ODR+1.
DMA2_Stream7->CR = 0;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2_Stream7->PAR = (uint32_t)(&GPIOD->ODR) + 1u; // high byte -> PD8-15
DMA2_Stream7->CR =
(7u << DMA_SxCR_CHSEL_Pos) // channel 7 = TIM8_CH4
| (0u << DMA_SxCR_MSIZE_Pos) // memory size = byte
| (0u << DMA_SxCR_PSIZE_Pos) // periph size = byte
| DMA_SxCR_MINC // increment memory
| (1u << DMA_SxCR_DIR_Pos) // memory -> peripheral
| (2u << DMA_SxCR_PL_Pos); // priority high

TIM8->EGR = TIM_EGR_UG; // load shadow regs (timer idle)
GPIOC->BSRR = WR_C_HIGH; // PC8 = 1 (idle high), stays GPIO output

Start the transfer:

 

 

 

7 replies

waclawek.jan
Super User
June 22, 2026

> What is the worst-case latency

  1. AN4031 discusses the latency within the dual-port DMA, although the details are not crystal clear. Assuming memory-side loads are fast enough (see 3), and assuming no other DMA Stream is active within that DMA, I’d say, 3-5 cycles from TIM8_CH4’s “active” (CC) edge to the moment where DMA starts to output data onto the bus matrix.
  2. There’s probably one cycle of arbitration at the bus matrix, if the given target bus (where GPIO sits, i.e. AHB1) is free. But it may not be. Look what else is at AHB1:

note, that these are the “register interfaces” not “data interfaces”, in case of ETH, DMA2D and OTG HS. If processor, or any other master (e.g. DMA1, but that’s unlikely) accesses any of those modules’ registers, that may create a conflict and DMA2 may need to wait. Note, that DMA’s own registers are there, too, i.e. polling DMA’s status register by processor may create such delay, too. Also, some ETH registers are in a different clock domain than AHB, so reading those may (or may not, depending on the particular implementation, and I am not sure how well is this documented) impose somewhat more than just one or two cycles of delay. 

  1. You appear to use Direct mode of DMA (i.e. no FIFO), byte wide. That means, that after DMA outputs a byte to its Peripheral port, i.e. the towards GPIO, it starts to read in the next byte from Memory port, presumably from SRAM. If there is a serious bus contention there, it may have an impact. (For example, ETH may be set up to read/write data in bursts, both “real” data and the scatter-gather-descriptors, and that means it occupies the given memory/bus for several cycles. DMA2D also uses 8-beat bus cycles, as we’ve learned just today.) This is not part of timer-to-DMA-to-GPIO latency per se, but adds up to the “total turnaround time” for given byte.

The simplest thing to do is to enable DMA FIFO and observe if the problem goes away. After enabling DMA I’d wait a tad bit (perhaps setting up TIM8 registers meantime) before enabling TIM8, to allow the FIFO to fill up.

If this won’t help, 2. is probably the problem, so then at least as an experiment I’d avoid accessing any registers from AHB1 for the duration of DMA transfer (which then also shouldn’t be polled but handled in its TC interrupt).

I still don’t understand, why don’t you simply use FSMC, as I’ve recommended elsewhere.

JW

PHolt.1
PHolt.1Author
Senior
June 22, 2026

 Thank you Jan :)

FIFO enabled now, excellent point, but no change.

DMA-end polling was indeed flat out so it has been spread out to once every 2us, no change.

I am using my 3034 scope with a qualified trigger, +ve /wr clock edge followed within <10ns by a data edge (a fail on data hold time; should never happen) and I am seeing lots of weird stuff according to the ARR CCR3 CCR4 values. ARR is the timer cycle time, CCR3 is the +ve /wr edge, and CCR4 is the DMA trigger. Any ARR below 15 (168MHz periods, this is) produces snow. I can’t trigger the scope on “snow” but this qualified trigger shows 

one event where DMA was not triggered but then got triggered with a very short latency, and the rightmost case shows a DMA trigger latency some 20-30ns later than normally happens.
The ARR CCR3 CCR4 values for the above are 15 6 5. In theory CCR4 should be >= CCR3 but CCR3 produces the /WR +ve edge instantly whereas the DMA transfer takes much longer, so I am playing with different (shorter) values for CCR4.

Unless I am missing something, there are real problems with timers triggering DMA. You can basically forget anything below say 50ns. But the DMA latency can be ~zero especially if it missed the preceeding trigger.

But it works (11 5 5 produces no scope triggers) and for driving an LCD whose display is totally redrawn at 20fps+ the snow, visible only with a magnifier, can be tolerated.

Actually I am not even sure the snow is caused by these DMA latency artefacts because I see snow even where the scope is never triggered (by a hold time violation).

The fact that with ARR=15+ I see no snow proves this issue is not caused by some other task corrupting the data bus. I spent a lot of time on that :)

Re you question re FSMC, I don’t have the GPIO left over. I have only PD8-PD15, /WR coming out of PC8. QFP100. If I went to QFP144 and reassigned Ports D and E then D E could come out on the FSMC. In fact PD8-PD15 have a load of LEDs on them but they don’t mind being driven at 10MHz+ :)

PHolt.1
PHolt.1Author
Senior
June 22, 2026

I typed a reply and it got blocked, for mod approval :)

If the message does not eventually appear I will have to retype it.

But in the meantime I have noted one thing: There are no scope artefacts (eg hold time violations) or LCD corruption on short dma transfers. It needs something like 50 bytes minimum before it breaks. It is as if some error accumulated during the transfer. Incredible.

waclawek.jan
Super User
June 23, 2026

The “DMA not triggered” followed by “DMA triggered earlier” is not true. In fact, the “DMA triggered earlier” is the DMA which is supposed to be trigged by the “DMA not triggered” edge, with a latency of almost whole ARR cycle. If ARR is 11, that latency here may be 8-9 cycles, which is very much in line with the theory that something occupies some of the buses for some 8 cycles, and that may very well be a 8-beat DMA burst from some other master (ETH, DMA2D, OTG-HS), into the same RAM.

So, maybe putting the LCD buffer to a different RAM could help.

---

> It needs something like 50 bytes minimum before it breaks.

And without FIFO?

With DMA FIFO on, you can also change the memory-side transfer width to word. The addresses and sizes have to be aligned (i.e. multiples of 4), too, but I believe that’s not an issue with the usual LCDs.

---

Also, you did mention you do use ETH. Have you tried to disable it, to see impact on display?

If it helps, check content of ETH_DMABMR, you may want to play with PBL/RDP/FB/USP/PFM bits/bitfields.

JW

PHolt.1
PHolt.1Author
Senior
June 24, 2026

With no ETH in the build, the snow is the same, regardless of USB activity.

So I doubt if the snow is the same problem as the above DMA latency illustration, because the snow appears to be very frequent - probably of the order of tens of funny pixels popping up somewhere, each second. But the above DMA latency is much more rare - of the order of once every few mins.

With no ETH in the build I cannot see any cases of the DMA latency, no matter what I do on USB. But as I say above, same snow.

With ETH in the build and a fair bit of ETH traffic I can see the latency problem but it can be once in some minutes. This is with the FIFO; the FIFO did not seem to change anything.

Having a lot of USB VCP traffic (I cannot easily build without USB in my current setup) does not seem to have any effect. This is USB FS, not HS, BTW.

So I think they are two different things, and I cannot see how I could trigger the scope on the “snow”.

I was wrong about the snow never appearing within the first say 50 bytes of a DMA transfer. It does, but much more rarely.

I can’t really feed the DMA from another bit of RAM because in my project the RAM is basically the “classic” layout i.e. BSS, stact, the heap between the two. The CCM (which on the 32f4 is not DMA-accessible anyway) is reserved for RTOS stacks; that decision could be argued both ways but I find the 64k CCM is well sized for all my FreeRTOS stacks, with some 20-50% free depending on the build.

It is possibly true that enabling the DMA FIFO did enable the snow to appear more randomly over the display. Previously it wasn’t doing that.

If the snow is unchanged with ETH entirely removed from the build, then superficially it appears that ETH-related bus contention is not the cause of the snow, and playing with ETH_DMABMR (PBL/RDP/FB/USP/PFM — those tune the ETH DMA's burst length and behaviour) will do nothing for the snow. BUT I can trigger the scope on the ETH-related probable long DMA latency, so I will work on that. The two may be related in some subtle way, or something else (other than ETH) may be grabbing the bus.

Changing
dmainit.TxDMABurstLength = ETH_TXDMABURSTLENGTH_32BEAT;
to
dmainit.TxDMABurstLength = ETH_TXDMABURSTLENGTH_1BEAT;
(for both TX and RX)
does not (as expected) change the snow but does almost remove the DMA latency event. It was the ETH_RX… that made the difference, not the ETH_TX… one.
Stepping back up (doing both TX and RX):
ETH_TXDMABURSTLENGTH_16BEAT - latency events
ETH_TXDMABURSTLENGTH_8BEAT - latency events
ETH_TXDMABURSTLENGTH_4BEAT - latency events
ETH_TXDMABURSTLENGTH_2BEAT - rare latency events
ETH_TXDMABURSTLENGTH_1BEAT - almost no latency events; this value chosen for now
I have no particular ETH performance requirements; the TX is effectively polled (LWIP sends a packet to ETH when it has one from the socket application) and the RX is polled in an RTOS task to ensure that a rogue device on the LAN cannot hang the product, limiting the speed to a few hundred kbytes/sec which is more than enough.
But the burst length is not a complete solution.

If you have any ideas on the snow, I am all ears :)

The code below is the ST code used to initialise ETH DMA

/**
* @brief Configures Ethernet MAC and DMA with default parameters.
* @param heth pointer to a ETH_HandleTypeDef structure that contains
* the configuration information for ETHERNET module
* @param err Ethernet Init error
* @retval HAL status
*/
static void ETH_MACDMAConfig(ETH_HandleTypeDef *heth, uint32_t err)
{
ETH_MACInitTypeDef macinit;
ETH_DMAInitTypeDef dmainit;
uint32_t tmpreg1 = 0U;

if (err != ETH_SUCCESS) /* Auto-negotiation failed */
{
/* Set Ethernet duplex mode to Full-duplex */
(heth->Init).DuplexMode = ETH_MODE_FULLDUPLEX;

/* Set Ethernet speed to 100M */
(heth->Init).Speed = ETH_SPEED_100M;
}

/* Ethernet MAC default initialization **************************************/
macinit.Watchdog = ETH_WATCHDOG_ENABLE;
macinit.Jabber = ETH_JABBER_ENABLE;
macinit.InterFrameGap = ETH_INTERFRAMEGAP_96BIT;
macinit.CarrierSense = ETH_CARRIERSENCE_ENABLE;
macinit.ReceiveOwn = ETH_RECEIVEOWN_ENABLE;
macinit.LoopbackMode = ETH_LOOPBACKMODE_DISABLE;
if(heth->Init.ChecksumMode == ETH_CHECKSUM_BY_HARDWARE)
{
macinit.ChecksumOffload = ETH_CHECKSUMOFFLAOD_ENABLE;
}
else
{
macinit.ChecksumOffload = ETH_CHECKSUMOFFLAOD_DISABLE;
}
macinit.RetryTransmission = ETH_RETRYTRANSMISSION_DISABLE;
macinit.AutomaticPadCRCStrip = ETH_AUTOMATICPADCRCSTRIP_DISABLE;
macinit.BackOffLimit = ETH_BACKOFFLIMIT_10;
macinit.DeferralCheck = ETH_DEFFERRALCHECK_DISABLE;
macinit.ReceiveAll = ETH_RECEIVEAll_DISABLE;
macinit.SourceAddrFilter = ETH_SOURCEADDRFILTER_DISABLE;
macinit.PassControlFrames = ETH_PASSCONTROLFRAMES_BLOCKALL;
macinit.BroadcastFramesReception = ETH_BROADCASTFRAMESRECEPTION_ENABLE;
macinit.DestinationAddrFilter = ETH_DESTINATIONADDRFILTER_NORMAL;
macinit.PromiscuousMode = ETH_PROMISCUOUS_MODE_DISABLE;
macinit.MulticastFramesFilter = ETH_MULTICASTFRAMESFILTER_PERFECT;
macinit.UnicastFramesFilter = ETH_UNICASTFRAMESFILTER_PERFECT;
macinit.HashTableHigh = 0x0U;
macinit.HashTableLow = 0x0U;
macinit.PauseTime = 0x0U;
macinit.ZeroQuantaPause = ETH_ZEROQUANTAPAUSE_DISABLE;
macinit.PauseLowThreshold = ETH_PAUSELOWTHRESHOLD_MINUS4;
macinit.UnicastPauseFrameDetect = ETH_UNICASTPAUSEFRAMEDETECT_DISABLE;
macinit.ReceiveFlowControl = ETH_RECEIVEFLOWCONTROL_DISABLE;
macinit.TransmitFlowControl = ETH_TRANSMITFLOWCONTROL_DISABLE;
macinit.VLANTagComparison = ETH_VLANTAGCOMPARISON_16BIT;
macinit.VLANTagIdentifier = 0x0U;

/*------------------------ ETHERNET MACCR Configuration --------------------*/
/* Get the ETHERNET MACCR value */
tmpreg1 = (heth->Instance)->MACCR;
/* Clear WD, PCE, PS, TE and RE bits */
tmpreg1 &= ETH_MACCR_CLEAR_MASK;
/* Set the WD bit according to ETH Watchdog value */
/* Set the JD: bit according to ETH Jabber value */
/* Set the IFG bit according to ETH InterFrameGap value */
/* Set the DCRS bit according to ETH CarrierSense value */
/* Set the FES bit according to ETH Speed value */
/* Set the DO bit according to ETH ReceiveOwn value */
/* Set the LM bit according to ETH LoopbackMode value */
/* Set the DM bit according to ETH Mode value */
/* Set the IPCO bit according to ETH ChecksumOffload value */
/* Set the DR bit according to ETH RetryTransmission value */
/* Set the ACS bit according to ETH AutomaticPadCRCStrip value */
/* Set the BL bit according to ETH BackOffLimit value */
/* Set the DC bit according to ETH DeferralCheck value */
tmpreg1 |= (uint32_t)(macinit.Watchdog |
macinit.Jabber |
macinit.InterFrameGap |
macinit.CarrierSense |
(heth->Init).Speed |
macinit.ReceiveOwn |
macinit.LoopbackMode |
(heth->Init).DuplexMode |
macinit.ChecksumOffload |
macinit.RetryTransmission |
macinit.AutomaticPadCRCStrip |
macinit.BackOffLimit |
macinit.DeferralCheck);

/* Write to ETHERNET MACCR */
(heth->Instance)->MACCR = (uint32_t)tmpreg1;

/* Wait until the write operation will be taken into account:
at least four TX_CLK/RX_CLK clock cycles */
tmpreg1 = (heth->Instance)->MACCR;
HAL_Delay(ETH_REG_WRITE_DELAY);
(heth->Instance)->MACCR = tmpreg1;

/*----------------------- ETHERNET MACFFR Configuration --------------------*/
/* Set the RA bit according to ETH ReceiveAll value */
/* Set the SAF and SAIF bits according to ETH SourceAddrFilter value */
/* Set the PCF bit according to ETH PassControlFrames value */
/* Set the DBF bit according to ETH BroadcastFramesReception value */
/* Set the DAIF bit according to ETH DestinationAddrFilter value */
/* Set the PR bit according to ETH PromiscuousMode value */
/* Set the PM, HMC and HPF bits according to ETH MulticastFramesFilter value */
/* Set the HUC and HPF bits according to ETH UnicastFramesFilter value */
/* Write to ETHERNET MACFFR */
(heth->Instance)->MACFFR = (uint32_t)(macinit.ReceiveAll |
macinit.SourceAddrFilter |
macinit.PassControlFrames |
macinit.BroadcastFramesReception |
macinit.DestinationAddrFilter |
macinit.PromiscuousMode |
macinit.MulticastFramesFilter |
macinit.UnicastFramesFilter);

/* Wait until the write operation will be taken into account:
at least four TX_CLK/RX_CLK clock cycles */
tmpreg1 = (heth->Instance)->MACFFR;
HAL_Delay(ETH_REG_WRITE_DELAY);
(heth->Instance)->MACFFR = tmpreg1;

/*--------------- ETHERNET MACHTHR and MACHTLR Configuration --------------*/
/* Write to ETHERNET MACHTHR */
(heth->Instance)->MACHTHR = (uint32_t)macinit.HashTableHigh;

/* Write to ETHERNET MACHTLR */
(heth->Instance)->MACHTLR = (uint32_t)macinit.HashTableLow;
/*----------------------- ETHERNET MACFCR Configuration -------------------*/

/* Get the ETHERNET MACFCR value */
tmpreg1 = (heth->Instance)->MACFCR;
/* Clear xx bits */
tmpreg1 &= ETH_MACFCR_CLEAR_MASK;

/* Set the PT bit according to ETH PauseTime value */
/* Set the DZPQ bit according to ETH ZeroQuantaPause value */
/* Set the PLT bit according to ETH PauseLowThreshold value */
/* Set the UP bit according to ETH UnicastPauseFrameDetect value */
/* Set the RFE bit according to ETH ReceiveFlowControl value */
/* Set the TFE bit according to ETH TransmitFlowControl value */
tmpreg1 |= (uint32_t)((macinit.PauseTime << 16U) |
macinit.ZeroQuantaPause |
macinit.PauseLowThreshold |
macinit.UnicastPauseFrameDetect |
macinit.ReceiveFlowControl |
macinit.TransmitFlowControl);

/* Write to ETHERNET MACFCR */
(heth->Instance)->MACFCR = (uint32_t)tmpreg1;

/* Wait until the write operation will be taken into account:
at least four TX_CLK/RX_CLK clock cycles */
tmpreg1 = (heth->Instance)->MACFCR;
HAL_Delay(ETH_REG_WRITE_DELAY);
(heth->Instance)->MACFCR = tmpreg1;

/*----------------------- ETHERNET MACVLANTR Configuration ----------------*/
/* Set the ETV bit according to ETH VLANTagComparison value */
/* Set the VL bit according to ETH VLANTagIdentifier value */
(heth->Instance)->MACVLANTR = (uint32_t)(macinit.VLANTagComparison |
macinit.VLANTagIdentifier);

/* Wait until the write operation will be taken into account:
at least four TX_CLK/RX_CLK clock cycles */
tmpreg1 = (heth->Instance)->MACVLANTR;
HAL_Delay(ETH_REG_WRITE_DELAY);
(heth->Instance)->MACVLANTR = tmpreg1;

/* Ethernet DMA default initialization ************************************/
dmainit.DropTCPIPChecksumErrorFrame = ETH_DROPTCPIPCHECKSUMERRORFRAME_ENABLE;
dmainit.ReceiveStoreForward = ETH_RECEIVESTOREFORWARD_ENABLE;
dmainit.FlushReceivedFrame = ETH_FLUSHRECEIVEDFRAME_ENABLE;
dmainit.TransmitStoreForward = ETH_TRANSMITSTOREFORWARD_ENABLE;
dmainit.TransmitThresholdControl = ETH_TRANSMITTHRESHOLDCONTROL_64BYTES;
dmainit.ForwardErrorFrames = ETH_FORWARDERRORFRAMES_DISABLE;
dmainit.ForwardUndersizedGoodFrames = ETH_FORWARDUNDERSIZEDGOODFRAMES_DISABLE;
dmainit.ReceiveThresholdControl = ETH_RECEIVEDTHRESHOLDCONTROL_64BYTES;
dmainit.SecondFrameOperate = ETH_SECONDFRAMEOPERARTE_ENABLE;
dmainit.AddressAlignedBeats = ETH_ADDRESSALIGNEDBEATS_ENABLE;
dmainit.FixedBurst = ETH_FIXEDBURST_ENABLE;
//dmainit.RxDMABurstLength = ETH_RXDMABURSTLENGTH_32BEAT;
//dmainit.TxDMABurstLength = ETH_TXDMABURSTLENGTH_32BEAT;
dmainit.RxDMABurstLength = ETH_RXDMABURSTLENGTH_1BEAT; // PH 24/6/26
dmainit.TxDMABurstLength = ETH_TXDMABURSTLENGTH_1BEAT; // PH 24/6/26
dmainit.EnhancedDescriptorFormat = ETH_DMAENHANCEDDESCRIPTOR_ENABLE;
dmainit.DescriptorSkipLength = 0x0U;
dmainit.DMAArbitration = ETH_DMAARBITRATION_ROUNDROBIN_RXTX_1_1;

/* Get the ETHERNET DMAOMR value */
tmpreg1 = (heth->Instance)->DMAOMR;
/* Clear xx bits */
tmpreg1 &= ETH_DMAOMR_CLEAR_MASK;

/* Set the DT bit according to ETH DropTCPIPChecksumErrorFrame value */
/* Set the RSF bit according to ETH ReceiveStoreForward value */
/* Set the DFF bit according to ETH FlushReceivedFrame value */
/* Set the TSF bit according to ETH TransmitStoreForward value */
/* Set the TTC bit according to ETH TransmitThresholdControl value */
/* Set the FEF bit according to ETH ForwardErrorFrames value */
/* Set the FUF bit according to ETH ForwardUndersizedGoodFrames value */
/* Set the RTC bit according to ETH ReceiveThresholdControl value */
/* Set the OSF bit according to ETH SecondFrameOperate value */
tmpreg1 |= (uint32_t)(dmainit.DropTCPIPChecksumErrorFrame |
dmainit.ReceiveStoreForward |
dmainit.FlushReceivedFrame |
dmainit.TransmitStoreForward |
dmainit.TransmitThresholdControl |
dmainit.ForwardErrorFrames |
dmainit.ForwardUndersizedGoodFrames |
dmainit.ReceiveThresholdControl |
dmainit.SecondFrameOperate);

/* Write to ETHERNET DMAOMR */
(heth->Instance)->DMAOMR = (uint32_t)tmpreg1;

/* Wait until the write operation will be taken into account:
at least four TX_CLK/RX_CLK clock cycles */
tmpreg1 = (heth->Instance)->DMAOMR;
HAL_Delay(ETH_REG_WRITE_DELAY);
(heth->Instance)->DMAOMR = tmpreg1;

/*----------------------- ETHERNET DMABMR Configuration ------------------*/
/* Set the AAL bit according to ETH AddressAlignedBeats value */
/* Set the FB bit according to ETH FixedBurst value */
/* Set the RPBL and 4*PBL bits according to ETH RxDMABurstLength value */
/* Set the PBL and 4*PBL bits according to ETH TxDMABurstLength value */
/* Set the Enhanced DMA descriptors bit according to ETH EnhancedDescriptorFormat value*/
/* Set the DSL bit according to ETH DesciptorSkipLength value */
/* Set the PR and DA bits according to ETH DMAArbitration value */
(heth->Instance)->DMABMR = (uint32_t)(dmainit.AddressAlignedBeats |
dmainit.FixedBurst |
dmainit.RxDMABurstLength | /* !! if 4xPBL is selected for Tx or Rx it is applied for the other */
dmainit.TxDMABurstLength |
dmainit.EnhancedDescriptorFormat |
(dmainit.DescriptorSkipLength << 2U) |
dmainit.DMAArbitration |
ETH_DMABMR_USP); /* Enable use of separate PBL for Rx and Tx */

/* Wait until the write operation will be taken into account:
at least four TX_CLK/RX_CLK clock cycles */
tmpreg1 = (heth->Instance)->DMABMR;
HAL_Delay(ETH_REG_WRITE_DELAY);
(heth->Instance)->DMABMR = tmpreg1;

if((heth->Init).RxMode == ETH_RXINTERRUPT_MODE)
{
/* Enable the Ethernet Rx Interrupt */
__HAL_ETH_DMA_ENABLE_IT((heth), ETH_DMA_IT_NIS | ETH_DMA_IT_R);
}

/* Initialize MAC address in ethernet MAC */
ETH_MACAddressConfig(heth, ETH_MAC_ADDRESS0, heth->Init.MACAddr);
}


 

 

PHolt.1
PHolt.1Author
Senior
June 24, 2026

To add: a quick and dirty test with 32 bit loading seemed to produce a lot less snow. So I will try to implement that if datasize is a multiple of 4.

The problem is that the 1st byte has to be manually set up, which makes this rather messy… The DMA will be fetching 4 bytes at a time, so the part of the buffer after the 1st byte needs to be 4-aligned.

This is worth doing because a) the vast majority of data on this project is a multiple of 4 and b) loading up 4 bytes at a time must improve any contention with other bus masters.

PHolt.1
PHolt.1Author
Senior
June 25, 2026

After many more hours, I can report that enabling the DMA FIFO does nothing.

Switching the load width to 32 bits might solve some or all of it. I got some clean drawing but “wrong graphics” because there is a complication: if you use the “timer pwm mode” to generate the /wr pulse and then to trigger the dma to load the next byte, you have to load the 1st byte “manually” and then dma can load subsequent 4 byte blocks, which means the source buffer (wherever it is) has to be 4-aligned after the 1st byte :) It gets just too messy.

So I went back to the 15 cycle config; 89ns cycle ie 11.2MB/sec and that is the fastest possible which produces no snow, no massively time-shifted data loads (due to dma latency) but is well below the ~15MB/sec possible with the ST7789 (and easily achieved by bit-banging PD8-PD15 albeit with 100% CPU loading).

The ETH DMA burst lengths (post above) do not really have much or any effect with the 15 cycle config, so something else is going on here.

I wonder if there is a way to avoiding the manual load of the 1st byte. Basically a way to enable the DMA to move the data and then trigger the timer to produce the /wr, and repeat. Maybe with chained dma channels?

waclawek.jan
Super User
June 25, 2026

With DMA set up and enabled, and TIM set up (especially TIMx_DIER) but not running (i.e. TIMx_CR1.CEN still = 0), you can trigger DMA simply by generating the respective CC event using TIMx_EGR.

JW