DMA operation occasionally stops during the burst transfer of data from external SDRAM to eMMC via DMA

devchang · ‎2024-08-16

There is a rare issue where the DMA operation occasionally stops during the burst transfer of data from external SDRAM to eMMC via DMA. I’m having difficulty identifying the exact cause, so I’m posting here to see if anyone has any good ideas.

The hardware details are as follows:

The MCU being used is the STM32F469NI.
SDRAM is accessed via the FMC bus. To be precise, the FPGA is directly connected to the FMC, and the SDRAM is accessed indirectly by passing the desired SDRAM address to the FPGA through the FMC.
The eMMC is connected to the MCU via the SDIO interface.

uint32_t CPL_MMC::write_from_ddr(u32 i_n_ddr_addr, uint32_t WriteAddr, uint32_t NumberOfBlocks)
{
    u32 n_ddr_addr = i_n_ddr_addr;
    u8* writebuff = NULL;

    uint32_t errorstatus = HAL_MMC_ERROR_NONE;

    CPL_FPGA::get_instance()->m_mutex_c.lock();

    CPL_FPGA::get_instance()->write_no_lock(eFPGA_REG_DDR_HIGHBIT_ADDR_CTL,n_ddr_addr>>FMC_ADDR_LINE_NUM);
    n_ddr_addr &= FMC_ADDR_MASK;
    n_ddr_addr |= FMC_DDR_SELECT;

    writebuff = (u8*)(FPGA_FSM_ADDRESS + (n_ddr_addr<<1));
    CPL_FPGA::get_instance()->fsmc_configuration(17);
    errorstatus = write(writebuff, WriteAddr, NumberOfBlocks );

    CPL_FPGA::get_instance()->m_mutex_c.unlock();

    return errorstatus;
}

void CPL_FPGA::write_no_lock(u32 addr, u16 i_w_value)
{
	addr &= ~FMC_DDR_SELECT;
	addr = FPGA_FSM_ADDRESS + (addr<<1);
    fsmc_configuration(10);

	HAL_SRAM_Write_16b(&hsram1, (uint32_t*)addr, &i_w_value, 1);
}

void   CPL_FPGA::fsmc_configuration(s32 i_n_latency)
{
    FMC_NORSRAM_TimingTypeDef Timing = {0};
    /* Timing */
    Timing.AddressSetupTime = 0;
    Timing.AddressHoldTime = 0;
    Timing.DataSetupTime = 0;
    Timing.BusTurnAroundDuration = 15;
    Timing.CLKDivision = m_n_clk_divide;
    Timing.DataLatency = i_n_latency;
    Timing.AccessMode = FMC_ACCESS_MODE_A;

    HAL_SRAM_Init(&hsram1, &Timing, NULL);
}

uint32_t CPL_MMC::write(uint8_t *writebuff, uint32_t WriteAddr, uint32_t NumberOfBlocks)
{
	uint32_t errorstatus = HAL_MMC_ERROR_NONE;
	m_mutex_c.lock();

	uint32_t timeout = 0;

	while (HAL_DMA_GetState(&hdma_sdio) != HAL_DMA_STATE_RESET)
	{
	}

	hdma_sdio.Instance = DMA2_Stream3;
	hdma_sdio.Init.Channel = DMA_CHANNEL_4;
	hdma_sdio.Init.Direction = DMA_MEMORY_TO_PERIPH;
	hdma_sdio.Init.FIFOMode = DMA_FIFOMODE_ENABLE;
	hdma_sdio.Init.FIFOThreshold = DMA_FIFO_THRESHOLD_FULL;
	hdma_sdio.Init.MemBurst = DMA_MBURST_INC16;
	hdma_sdio.Init.MemDataAlignment = DMA_MDATAALIGN_BYTE;
	hdma_sdio.Init.MemInc = DMA_MINC_ENABLE;
	hdma_sdio.Init.Mode = DMA_PFCTRL;
	hdma_sdio.Init.PeriphBurst = DMA_PBURST_INC4;
	hdma_sdio.Init.PeriphDataAlignment = DMA_PDATAALIGN_WORD;
	hdma_sdio.Init.PeriphInc = DMA_PINC_DISABLE;
	hdma_sdio.Init.Priority = DMA_PRIORITY_VERY_HIGH;

	HAL_DMA_Init(&hdma_sdio);
	hmmc.hdmatx = &hdma_sdio;
	if(HAL_MMC_WriteBlocks_DMA(&hmmc, writebuff, WriteAddr, NumberOfBlocks) == HAL_OK)
	{
		timeout = HAL_GetTick();
		while((WriteStatus == 0) && ((HAL_GetTick() - timeout) < MMC_TIMEOUT))
		{
		}

		/* incase of a timeout return error */
		if (WriteStatus == 0)
		{
			printf("[error] mmc write fail %s %d, id: 0x%x\n", __FUNCTION__, __LINE__, hmmc.ErrorCode);
			errorstatus = RES_ERROR;
		}
		else
		{
			WriteStatus = 0;
			timeout = HAL_GetTick();

			while((HAL_GetTick() - timeout) < MMC_TIMEOUT)
			{
				if(HAL_MMC_GetCardState(&hmmc) == HAL_MMC_CARD_TRANSFER)
				{
					errorstatus = RES_OK;
					break;
				}
				HAL_Delay(6);
			}
		}
	}
	else
	{
		printf("[error] mmc write fail %s %d, id: 0x%x\n", __FUNCTION__, __LINE__, hmmc.ErrorCode);
		errorstatus = HAL_ERROR;
	}

	HAL_DMA_DeInit(&hdma_sdio);

    m_mutex_c.unlock();
    return errorstatus;
}

To transfer large amounts of data in 64KByte chunks from SDRAM to eMMC, the CPL_MMC::write_from_ddr function is called repeatedly several times. The issue I am asking about occurs within this function when CPL_MMC::write is called.

The CPL_MMC::write code uses HAL_MMC_WriteBlocks_DMA to transfer data from SDRAM, which is accessed via MMIO (Memory Mapped I/O), to the eMMC. After HAL_MMC_WriteBlocks_DMA is called, the system waits for 65,535 milliseconds. If a DMA interrupt does not occur within this time, a timeout is triggered.

When the issue occurs, a timeout happens, and upon checking the ErrorCode returned by the HAL driver, it contains the value SDMMC_ERROR_TX_UNDERRUN (0x10). After this issue occurs, the DMA interrupt that indicates data transfer completion does not occur, and the eMMC enters a problematic state, causing any attempts by other tasks to access the eMMC to return errors.

When the issue occurs, the waveform of the FMC bus was captured from the FPGA. The results show that while the FPGA properly prepares the SDRAM data for transfer, the MCU reads only the first 48 bytes during the burst read operation and then there is no further activity on the FMC bus.

This issue does not occur just by executing the previously described operation alone; the probability of it happening increases when other operations are performed simultaneously. In the current hardware setup, the FPGA contains a MAC for Ethernet communication with a PC.

To check the packet information received by the MCU, an EXTI interrupt is triggered to the MCU whenever a packet is received, and the MCU immediately accesses the FPGA registers via the FMC bus to retrieve the packet data. At this time, the HAL_SRAM_Read_16b HAL API is used to access the FPGA through the FMC bus. Since this code uses the FMC bus, synchronization is applied to avoid conflicts with other FMC bus operations, such as those in CPL_MMC::write, which also uses DMA. Therefore, FMC usage should not overlap.

While transferring data from SDRAM to eMMC via DMA, the issue does not always occur, even if Ethernet packet reception is repeated. Typically, it takes over 10 hours of repeated operation testing for the issue to manifest. However, if Ethernet packet reception is completely disabled, the problem does not arise even after several days of repeated operation testing.

This leads me to believe that the problem occurs when Ethernet packet reception happens at a very specific and critical timing during the DMA data transfer. I suspect that the EXTI interrupt and FMC access using the HAL_SRAM_Read_16b API during Ethernet reception might increase the likelihood of the issue. However, I cannot be certain that these factors interfere with DMA operation.

I anticipate that my explanation may seem quite complicated, so I want to thank everyone who takes the time to read through it. While I’m not necessarily looking for an exact solution, I would greatly appreciate any ideas or suggestions on what else I could investigate to debug this issue.

Tesla DeLorean · ‎2024-08-16

Is the SDMMC flagging some under/overrun issue? Or the eMMC itself? A race condition in the DMA completion interrupt/flagging?

Buffers aligned? Any fault on the DMA side?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..