2021-05-10 02:14 PM
I have an application running on an ST32F446 using code generated by STM32CubeMX. I use SPI1 in slave mode to receive commands, and have FATFS operating with USB HOST to connect to a USB FLASH drive. I can send commands via the SPI interface to open files, read files and write files, and everything is working perfectly most of the time. But on rare occasions I get a HardFault exception and the software locks up.
The HardFault always occurs while receiving a 2K block of data to be written to a file. I use HAL_SPI_Receive_DMA(&hspi1, spi_rcvbuffer, length) to initiate reception of the data (length has been verified to always be 2048 as expected). My HAL_SPI_RxCpltCallback() function will set a flag spi_rfb to zero when the SPI DMA transfer is complete. Meanwhile the main program waits in a loop for completion as follows:
while (spi_rfb) { // wait for received data
HAL_UART_Transmit(&huart1, "w", 1, 1000);
}
Note that outputting the "w" characters to UART1 is just a temporary thing for debugging so I can see that I am waiting for data. It is ALWAYS in this wait loop that the HardFault happens (which is also when the SPI receive DMA transfer in in progress).
I can often receive many thousands of 2K blocks of data to be written to a file, and can often write files in excess of 10 MBytes without an error. But every once in a while it HardFaults and I don't know why. The HardFault always occurs while I am in the wait loop above.
I need some guidance on how I can determine the cause of this HardFault. Any help would be appreciated.
2021-05-10 02:19 PM
Have a Hard Fault Handler that outputs actionable data about the processor registers and failure address, so you can inspect exactly what code fails, and perhaps add sanity checks to that to look for pointer errors, stack corruption, etc.
Joseph Yiu's books on the Cortex parts might offer some insight into the processor's operation if the TRM is too dry.
2021-05-10 02:23 PM
Examine the SCB registers to shed light on the reason for the hard fault.
2021-05-12 10:40 AM
This was a very difficult problem to solve. The SCB registers indicated I was executing IRQ35 (SPI1_IRQHandler) when the HardFault happened. Further debugging pinned it down to an RXNEIE interrupt. Eventually I found a severe bug in the HAL_SPI driver supplied by ST Micro. Here is my analysis of what was happening, and my ultimate solution.
I use SPI1 in slave mode and normally have a pending receive request which I start by calling HAL_SPI_Receive_IT(). However, I sometimes need to switch from an interrupt driven receive mode to a DMA driven receive mode. To do this I call HAL_SPI_Abort() followed by HAL_SPI_Receive_DMA(). This is where the wheel falls off the cart, so to speak. HAL_SPI_Abort works by changing hspi->RxISR to SPI_AbortRx_ISR (assuming RXNEIE was enabled, which it is), then waiting for a receive data interrupt, or a timeout. Here is the problem. In Master mode a receive data interrupt will happen very soon. In slave mode a receive data interrupt may not happen for a very long time. It may not happen at all! The SPI_AbortRx_ISR will disable the RXNEIT interrupt, but if no receive data interrupt happens HAL_SPI_Abort() will time out, return with an error, and the RXNEIE is left enabled. In fact, it is nearly impossible to abort a receive data request started by HAL_SPI_Receive_IT() by using HAL_SPI_Abort() when operating in slave mode. But I didn't realize this, and didn't bother to check the return status of HAL_SPI_Abort(). I simply called HAL_SPI_Receive_DMA(). One of the things that HAL_SPI_Receive_DMA() does is to set the hspi->RxISR function pointer to NULL. This leaves things in a very odd state. The RXNEIR bit is still set, the hspi->RxISR pointer is NULL, and there is a pending SPI receive transfer in DMA mode. It seems that in this condition the DMA normally takes the received data out of the shift register before a receive data interrupt can happen, and the DMA transfer completes normally. But every once in a while a receive data interrupt does happen, and the interrupt service routine blindly calls the hspi->RxISR function with the function pointer set to NULL. BANG, instant HardFault.
My solution was to clear the TXEIE, RXNEIE and ERRIE bits in the hspi1.Instance->CR2 register BEFORE calling HAL_SPI_Abort(). This avoids the timeout loops and ensures that the SPI1_IRQHandler cannot call a NULL function.
The best solution would be for ST Micro to fix their broken HAL_SPI_Abort() function so it can properly abort a pending transfer in slave mode.