SPI DMA master - slave communication issues, F722 and F765 SPI_FLAG_BSY and HAL_DMA_ERROR_FE errors (HAL)

Crystal · ‎2021-08-05

Hi there!

I have a project with one Master (F765) and three Slaves (F722), connected by different SPI channels for data exchange via DMA. Also slaves use additional Busy and DataReady signals for flow control and synchronization.

IRQ handlers works only with state flags. TransmitReceive_DMA operations called only from the main thread. FreeRTOS has not been used for testing purposes. Data and instruction caches were disabled. But I've bumped into several issues.

F722 SPI transactions (at any speed and any mode of operation - blocking, IT, DMA) in Slave mode occasionally led to error HAL_SPI_ERROR_FLAG after waiting for SPI_FLAG_BSY timeout. Data flow had lost packets and too many breaks as shown below:

I've tried to swap the roles of the controllers - 3 Masters (F722) and "triple" Slave with 3 SPIs in Slave mode (F765). Everything became excellent for each single SPI channel, but all links crashed at any simultaneous activity even on two channels. Thus it was not good idea.

Additional empty project was created and very simple testing code was applied for single SPI port to avoid any other reason of error.

Master main loop:

while (1) {
	if  (HAL_GPIO_ReadPin(BusyPin.Port, BusyPin.Pin) == GPIO_PIN_RESET)  {
		startTick = HAL_GetTick();
		HAL_SPI_Transmit(MasterSPI, payload, sizeof payload, 1);
		TxCount++;
		// Wait until slave becomes busy after receive the packet - avoid occasional overrun
		while (HAL_GPIO_ReadPin(BusyPin.Port, BusyPin.Pin) != GPIO_PIN_SET && HAL_GetTick() - startTick < SPI_TIMEOUT) { ; } 
	}
}

Slave main loop:

while (1) {
	HAL_GPIO_WritePin(MCD_BUSY_GPIO_Port, MCD_BUSY_Pin, GPIO_PIN_RESET); // We are ready now
	spiRes = HAL_SPI_Receive(&hspi1, payload, sizeof payload, 1);
	HAL_GPIO_WritePin(MCD_BUSY_GPIO_Port, MCD_BUSY_Pin, GPIO_PIN_SET); // We are busy now
	SpiTotalPackets++;
	if (spiRes != HAL_OK) SpiFailCount++;
		else SpiRxCount++;
	if (SpiTotalPackets) SpiPercentOfErr = (float) SpiFailCount / (float) SpiTotalPackets;
	for (uint32_t i = 0; i < 256; i++) { ; } // Some delay about 10 uS
}

Percent of errors amounted 0.21, 0.32 and 0.45% for different slaves independently of speed (tested on 156KHz, 2.5MHz, 10 and 20 MHz)

There was no other way but to change the source code of the HAL to launch the project. At first SPI_FLAG_BSY checking was ignored in the SPI_WaitFlagStateUntilTimeout function. This completely removed errors on the slave side:

// Slave library, file : stm32f7xx_hal_spi.c
 
static HAL_StatusTypeDef SPI_WaitFlagStateUntilTimeout(SPI_HandleTypeDef *hspi, uint32_t Flag, FlagStatus State,
                                                       uint32_t Timeout, uint32_t Tickstart)
{
	if (Flag == SPI_FLAG_BSY) return HAL_OK; // Avoid SPI freezing
  __IO uint32_t count;
  uint32_t tmp_timeout;
  uint32_t tmp_tickstart;
// ... Code

Further full version of the link protocol was used (DMA and 3 parallel SPI channels), but now rare errors appeared on the master side during transmission. After few minutes of active packets exchange on all three channels Master (F765) catched HAL_DMA_ERROR_FE. So I was forced to remove control from this error as well:

// Master library, file : stm32f7xx_hal_dma.c
 
HAL_StatusTypeDef HAL_DMA_PollForTransfer(DMA_HandleTypeDef *hdma, HAL_DMA_LevelCompleteTypeDef CompleteLevel, uint32_t Timeout) {
// ... Code
    if((tmpisr & (DMA_FLAG_FEIF0_4 << hdma->StreamIndex)) != RESET)
    {
      /* Update error code */
      if(hdma->Init.FIFOMode != DMA_FIFOMODE_DISABLE) // Avoid disabled FIFO DMA errors
               hdma->ErrorCode |= HAL_DMA_ERROR_FE;
      /* Clear the FIFO error flag */
      regs->IFCR = DMA_FLAG_FEIF0_4 << hdma->StreamIndex;
    }
// ... Code
 
void HAL_DMA_IRQHandler(DMA_HandleTypeDef *hdma) {
// ... Code
  /* FIFO Error Interrupt management ******************************************/
  if ((tmpisr & (DMA_FLAG_FEIF0_4 << hdma->StreamIndex)) != RESET) {
    if(__HAL_DMA_GET_IT_SOURCE(hdma, DMA_IT_FE) != RESET) {
      /* Clear the FIFO error flag */
      regs->IFCR = DMA_FLAG_FEIF0_4 << hdma->StreamIndex;
      /* Update error code */
      if(hdma->Init.FIFOMode != DMA_FIFOMODE_DISABLE) // Avoid disabled FIFO DMA errors
               hdma->ErrorCode |= HAL_DMA_ERROR_FE;
    }
  }

Only after these manipulations did the transmission start fully on the all channels.

At last, Full Duplex data exchange between Master and three Slaves became healthy. For example there is a result of continuous ping-pong test with 64 bytes packets (full working program and protocol, DMA, 40 MHz SPI SCK frequency):

Even though it works, I think this is not nice solution. Therefore the question : is there any better way to resolve this problem without modifying the automatically generated by CubeMX sources?

P.S. And one more observation from this project - SPI slave mode on F722 is not very friendly towards the debugging, especially when watches are updated automatically. Lower IRQ priority for debugger doesn't help in this case.

TDK · ‎2021-08-05

Try using HAL_SPI_TransmitReceive exclusive instead of HAL_SPI_Transmit and HAL_SPI_Receive.

Use a much bigger timeout on the slave than 1ms. Presumably you want it wait forever for the master to clock in data.

When you don't get HAL_OK, what is the return value instead?

I wouldn't assume just ignoring the error flag will fix the issue. Presumably it's getting set for a reason, figure out what that is. Toggle a GPIO pin high when it happens so you can look at the scope and see what's going on with the signals at that point. Possibly master and slave are out of sync.

If you feel a post has answered your question, please click "Accept as Solution".

Crystal · ‎2021-08-06

TDK, thanks alot for your response!

As I saw, SPI_BSY flag is a known issue for various ST MCUs. For example, that was described in STM32F446xC/xE Errata sheet, and workaround is polling TXE, but such way is not particularly applicable at high speed.

The difference between the HAL_SPI_TransmitReceive_DMA and separately Transmit/Receive is only in starting two DMA threads for RX and TX. Timeout defined in HAL library as SPI_DEFAULT_TIMEOUT = 100 ms and it is checking in SPI_EndRxTxTransaction() function. But even 1 ms timout is too long. Moreover in case of error I need to resend lost packet.

DMA_FLAG_FEIF0_4 and HAL_DMA_ERROR_FE on DMA channel with disabled FIFO is alsow recurring question for years, as it turns out. I found more reliable solution by checking DMA handle field Init.FIFOMode == DMA_FIFOMODE_DISABLE. Hope it will work fine, but it still needs to "patch" the HAL sources (((

if((tmpisr & (DMA_FLAG_FEIF0_4 << hdma->StreamIndex)) != RESET) {
      /* Update error code */
		if(hdma->Init.FIFOMode != DMA_FIFOMODE_DISABLE) // Avoid disabled FIFO errors
			hdma->ErrorCode |= HAL_DMA_ERROR_FE;    
      /* Clear the FIFO error flag */
		regs->IFCR = DMA_FLAG_FEIF0_4 << hdma->StreamIndex;
    }

TDK · ‎2021-08-06

The first code in the OP doesn't use DMA. Any DMA issue can't be the explanation of why that isn't working.

If you feel a post has answered your question, please click "Accept as Solution".

Crystal · ‎2021-08-06

My apologize for not specifying.

Given simple code was created in addition to the main project and allowed to catch the cause of the error and to get some statistic.

After its elimination full-fledged transport protocol and more complex code were used, with DMA, interrupts and independent three high speed SPIs.

Thanks, I've corrected the topic for more clarity.

Petr Sladecek · ‎2021-08-10

Hello,

let me to comment your communication thread. Above all, the BSY bit monitoring at the end of the session is necessary at case the SPI slave at transmit only mode needs to be reconfigured or disabled after the latest session is fully completed (e.g. prior switching the slave into a low power mode). The main sense is to prevent any corruption of the ongoing session by such a premature action. Note the driver has to be written in conservative and universal way to cover all such cases. That is why the BSY test is included there to close the session safely. If the slave always stays active listening the bus there is no need to check BSY status once the expected number of data is applied by DMA and especially if the slave is receiver exclusively and the DMA reception session is completed.

Concerning coexistence of the DMA channels, note they are concurrent each other when compete for the common AHB bus matrix occupation as well as do the other AHB matrix masters via round-robin algorithm arbitration (there are 12 of them at F7 including the Cortex core). A significant accidental latency of a pending DMA channel service can appear e.g. at time the Cortex core performs some non-interruptible operation like saving/restoring context of CPU registers when entry into or return from an interrupt service or performing instruction like LDMIA in conjunction with some other DMA channel pending contemporary especially. Of course, DMA channels priority plays role, too. I suggest to control data overflow at slave receivers at least to check if there is a problem with DMA channel sufficient access. On master side, integrity of data received from slave should be always checked to detect if slave faces any data underrun. For more details user can read AN5543.

Best regards,

Petr, the IP owner, STMicroelectronics