SPI slave mode with DMA throughput

HZhu.12 · ‎2019-06-18

STM32MP157 has a M4 core, so I am asking questions here in MCU section although the chip is belong to MPU. What I am trying to do is using M4's SPI4 to receive data at throughput of 20.6Mbits/sec, basically each SPI transition has 10.5KB block data, and each second, there are 240 transactions. Currently I am using two STM32MP157-DK2 boards and running the modified M4 core example code of SPI DMA demo. The readme.txt file from the demo has attached. what I found out is, when SPI master side clock set to be 32MHz (hspi4.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_2), the slave side code crash, but it runs ok at 16MHz ((hspi4.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_4). With logic analyzer, I can see the slave side works for a while, and then MISO became all 0, and from LED flashing pattern I would see that slave side's callback of SPI transaction is not triggered anymore. Unfortunately at this moment, I haven't figure out what's proper setup for STLink to work with this board, so right now I cannot step through to see what's going on. But I attached my main.c and any suggestion is welcome. Also I would greatly appreciated that if anyone can confirm that SPI in slave mode is capable of achieving the throughput.

S.Ma · ‎2019-06-18

On the STM32 MCU side, first, I would implement SPI slave with DMA in cyclic mode, both RX and Tx.

Then I would add NSS pin to be EXTI interrupt so to detect when there is falling edge (to reinit DMA) and rising edge (to process data).

The NSS is like the I2C START and STOP bit to resync whatever is going on.

On most STM32, SPI Master can run up to core clock / 2 while slave are limited to core clock / 4.

21 MHz maybe doable between boards with flying wires, although migth be quite to the limit.

And from the master side, when toggling NSS, put a small delay before starting data flow as the slave is running the EXTI interrupt.

Similarly, when NSS goes high, keep it high few usec to let the processing ISR complete before next frame.

And to be sure, when NSS goes high, reset SPI cell entirely if you can't flush the SPI FIFOs (it's SPI IP generation dependent)

HZhu.12 · ‎2019-06-19

Hi, thank you very much for your quick reply and confirmed that SPI performance are different between master and slave, but would you please confirm that SPI slave is limited to core clock/4, not bus clock/4. For the chip I am working on, STM32MP157, the M4 side clock are setup by A7 core (linux side), by changing parameter (hspi4.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_4), I already verified SPI clock source is scaled from a source of 64MHz, if you confirmed that SPI limit to core clock/4, then I need to check what's the core speed setup, I hope it's running in full speed, 209MHz instead of 64MHz. But if the limitation is bus clock/4, then it make sense slave side not working for 32MHz.

Another thing I noticed is that code crash within 12hours, my test is sending 10KB data block mask by NSS at each SPI transition, and loop forever, even I setup SPI clock as 16MHz, the slave side always crash after a while, so far, haven't run through overnight test yet, for Master side, sending out is alright with 16MHz, but for 32MHz, code crash after a while, I don't know exact time yet, but failed my overnight test, LED stopped flashing on the morning indicate end of SPI transaction is not triggered anymore, also logic analyzer shows no activity on the SPI bus, Have you seen similar behavior when data throughput is approaching the SPI limitation?

If the SPI bus is not capable of delivering the throughput my design required either by DMA or SPI hardware, then I need consider other options, current plan, is either trying two things,

1, set SPI using 16bit instead of 8bit, hope to improve throughput and setup DMA to copy 2byte each time instead of single byte.

2, set receive data only at slave side to hope save half bandwidth on DMA, but still don't understand why code crash after a while if it's caused by memory bandwidth issue.

3, get EVAL boards to rerun test, because with these discovery test, somehow the STlink is not working, cannot do debugging with M4 core, so cannot see where code crash cannot check internal register values, but this step is really depends on whether first 2 step can help this throughput approaching SPI+DMA limitation.

Any suggestions are welcome.

S.Ma · ‎2019-06-19

You have to read the reference manual to know what s the maximum slave frequency, not specialist of MPU.

I used 13 STM32L4R5, 1 master, 12 slaves SPI, 2kbyte transfer per msec over few hours with core SYSCLK freq = 48 MHz and SPI at 12 MHz steady and glitch free.

8 or 16 bit SPI mode to me wouldn't change anything due to FIFOs in the SPI. just save few bus cycles for clock cycle optimizing clans.

It's important to have SPI slave guys put the DMA in cyclic mode onto a buffer.

Consider all interrupts that could pop (including Systick, USB, etc...) and their priorities to find out worst case delay to get EXTI ISR kick in.

Some debug advice: try this:

MAster:
 
	  HAL_GPIO_WritePin(GPIOE, GPIO_PIN_11, GPIO_PIN_RESET);
==> Add 100usec delay here for test
	  if(HAL_SPI_TransmitReceive_DMA(&hspi4, (uint8_t*)aTxBuffer, (uint8_t *)aRxBuffer, TRANSER_BLOCK_SIZE) != HAL_OK)
 
...
	    case TRANSFER_COMPLETE :
==> Add 100usec delay here for test
	      HAL_GPIO_WritePin(GPIOE, GPIO_PIN_11, GPIO_PIN_SET);

I'm not familiar with

hspi4.Init.MasterInterDataIdleness = SPI_MASTER_INTERDATA_IDLENESS_00CYCLE;

But if you have only point to point, you don't care that the pins are always selected...

For the slave, I wouldn't implement directly on the main loop. Here you rely too much on the perfect start of both boards at the same time, assuming all packets are perfect. If there is a glitch on the SCK line, you are dead in the water without any recovery until reset on both sides. You put a breakpoint on the slave side and boom, can't continue as it won't resync.

Do use interrupt, do use EXTI on NSS (and yes, probably the HAL won't help here, have to hack) both edges and reinit DMA in cyclic mode.

If it is possible in this SPI to flush the FIFOs, do it when NSS toggles.

When you are a slave and don't control the data flow, you generally need interrupt scheme.

HZhu.12 · ‎2019-06-20

Totally agree on your suggestion about calling HAL_SPI_TransmitReceive_DMA() on the main loop on slave side, SPI transaction should be synched by NSS pin only, and after initialization, hardware trigger slave side automatically.

The demo example came from ST require reset slave board and start slave side code before master side starting sending out data, my original plan is rewrite code after I confirm the capability of SPI throughput, unfortunately demo doesn't work as expected.

You mentioned that in case of FIFO, so changing 8bit to 16bit would not do much, how much do receive only on slave side, do you have any experience with it?

S.Ma · ‎2019-06-20

You need to focus on functionality before optimising. True, most of the examples in cube demonstrate the hw peripheral is functioning per the spec, which is very different from interact witg other devices overnight glitch free. This is the main difference between spec and real application world.

The spi ip went through 3 generations and the mpu and h7 are probably the 3rd one. Stm32 bus bandwidth taken by 8 or 16 bit transfer to me should be done AFTER the application works reliably. You know that spi slave works 4 times slower than sysclk, so at least 32 core clocks occured before a dma transfer as 8 bit. Transfer by packing 16 bit changes this to 1 clock taken by dma every 64, not an order of magnitude difference.... If you are so tight means the application is ready to break, unless you are for exame optimising power by lowering sysclk...