STM32L4 - DMA with SPI in slave mode doubles the first byte loaded into DMA

sergey239955 · ‎2020-01-30

We are migrating from STM32F103 to STM32L4. The code architecture is the same, as the product basically is the same. We just upgrade the micro.

We have an SPI that is configured in slave mode, and a device that supplies clock to the SPI. DMA is used to feed the SPI with data. What we do is we first init SPI, then we init and load DMA with data, then we enable DMA request and a few milliseconds later we enable the clock source. Everything is very basic and works absolutely fine on STM32F103.

We found that the STM32L4 with practically the same code (save for specifics of peripheral configuration) doubles the very first byte loaded into DMA. Say, if you load a string of 5 bytes 0x1, 0x2, 0x3,0x4 and 0x5, the SPI will output 0x1, 0x1, 0x2, 3, 4, and 5. Whatever the first byte is, it will be doubled (we suspect by SPI) because the DMA count is correct. Meaning that the DMA will stop after 5 bytes, while the SPI will output 6 bytes.

Anyone saw anything similar to it? We are positive that it is not our bug. It is almost impossible to do even if we wanted to. And the same code works fine in STM32F103.

waclawek.jan · ‎2020-01-30

Read out and post content of DMA and SPI registers just before starting the transfer.

JW

John Adriaan · ‎2020-01-30

I experienced this early on with my SPI work, when I transitioned from testing using poll to full DMA. It turned out that I'd stupidly kept the first SPI write in the code, and when SPI gave the TX Empty signal after that byte, DMA served the first byte from the buffer again (of course). I'd expect that to fail on the F103 though too - unless CPU speed is a factor?

You don't mention which flavour of STM32L4 you're using. You also don't mention if this happens on your very first transaction, or just some transaction(s) sometimes.

The fact that your example is for an odd byte count raises the question: what is the PSIZE setting of the SPI TX DMA channel? Is 16-bit packing an issue (see "Packing with DMA" on page 1318 of RM0394 Rev 4, the STM32L4xxxx Reference Manual)? Although, I would expect that to repeat the last byte, not the first...

Incidentally, in the errata for the L432/L442, there's a section "2.15.1 BSY bit may stay high at the end of data transfer in slave mode". I'm not saying that's your issue, but it's something to consider.

S.Ma · ‎2020-01-30

Most STM32L4 SPI IP use a newer version than F103.

Check the reference manual for FIFO in the SPI.

F103: When write 16 bit DR in SPI 8 bit mode, MSB was discarded. Now it's queued in the 32 bit fifo. So check your DMA in 8 or 16 bit mode.

This also causes the RXNE and TXE flags to have 2 different and selectable behaviour.

Also you should not need to clock freeze, to write on peripheral or DMA registers, clock must be on, then with the proper write config sequence, things will be glitch/error free.

One warning though: SPI Slave with DMA in cyclic mode will probably have issue that the TX FIFO is always filled by DMA. When NSS goes high, that FIFO should be reset to prepare the next transmission. As there is no way to flush this fifo (even with SPE bit) you may have to perform a RESET through SYS/RCC and reconfigure the SPI. I couldn't find a milder way through STM32L496 SPI.

John Adriaan · ‎2020-01-31

Note: I believe the OP was referring to the external Master SPI clock into the STM32L4 SPI Slave, not the internal STM32L4 clocks to the SPI peripheral...

As for the Slave DMA issue, I agree--although the easiest remedy is to be diligent about the number of bytes in every transfer. SPI "suffers" from the fact that while the slave is receiving its command, it's essentially required to transmit something to the master.

It is incumbent on the slave handling code to ensure that the TX DMA is either enabled after the command is received (along with managing the resultant underrun error), or "primed" with the required gibberish before the transfer starts.

waclawek.jan · ‎2020-01-31

> F103: When write 16 bit DR in SPI 8 bit mode, MSB was discarded. Now it's queued in the 32 bit fifo. So check your DMA in 8 or 16 bit mode.

Yes this is what came into my mind, too - and that would be one thing to check in the DMA registers I've asked for; but if DMA would be set to 16-bit on the peripheral side, wouldn't *all* bytes be doubled?

JW

John Adriaan · ‎2020-01-31

@JW Not really? During the middle of an extended transfer, if half-words are fed in but bytes are read out, then the system handles the conversion by only doing one two-byte read for each two bytes out. It's the edge cases (either odd-byte starts or ends) that are the "interesting" cases.

sergey239955 · ‎2020-01-31

Hi everyone,

to give more information in response to your questions/suggestions:

The micro is STM32L496.

I can read and print all registers (DMA and SPI) before the transfer starts, but I am not quite sure which field/bit to look at .

Some background: This is a radio. The SPI in slave mode gets clock from an RF part (a transceiver in SPI master mode). The micro prepares data for transmission by loading it into a memory array dmaBuffer, then by configuring the DMA to take data from the array and feed it to SPI3, and SPI3 to work in slave mode , and finally by instructing the transceiver to start the RF operation. The transceiver takes some time to do channel tuning etc, and once ready, starts outputting clock on SPI and in return getting modulation data previously loaded into the dmaBuffer and supplied to SPI by DMA.

Below is almost complete application code:

LL_SPI_InitTypeDef  SPI_Init_ToFromRadio;
LL_DMA_InitTypeDef  DMA_InitSPItoRadio;
u8 dmaBuffer[4096];
 
void startTx(void)
{
  LL_SPI_StructInit(&SPI_Init_ToFromRadio);
// change non-defaults to what we need 
  SPI_Init_ToFromRadio.NSS = LL_SPI_NSS_SOFT;
  SPI_Init_ToFromRadio.BitOrder = LL_SPI_LSB_FIRST;
 
  LL_SPI_DeInit(SPI3);
  LL_SPI_Init(SPI3, &SPI_Init_ToFromRadio);
  LL_SPI_Enable(SPI3);                                        
 
// DMA2 channel2 configuration   SPI to Radio  
  LL_DMA_DeInit(DMA2, LL_DMA_CHANNEL_2);  
 
  DMA_InitSPItoRadio.PeriphOrM2MSrcAddress = LL_SPI_DMA_GetRegAddr(SPI3);
  DMA_InitSPItoRadio.MemoryOrM2MDstAddress = (u32)&dmaBuffer[0];
  DMA_InitSPItoRadio.Direction = LL_DMA_DIRECTION_MEMORY_TO_PERIPH;
  DMA_InitSPItoRadio.NbData = 16;                       //16 bytes as an example
  DMA_InitSPItoRadio.PeriphOrM2MSrcIncMode = LL_DMA_PERIPH_NOINCREMENT;
  DMA_InitSPItoRadio.MemoryOrM2MDstIncMode = LL_DMA_MEMORY_INCREMENT;
  DMA_InitSPItoRadio.PeriphOrM2MSrcDataSize = LL_DMA_PDATAALIGN_BYTE;
  DMA_InitSPItoRadio.MemoryOrM2MDstDataSize = LL_DMA_MDATAALIGN_BYTE;
  DMA_InitSPItoRadio.Mode = LL_DMA_MODE_NORMAL;
  DMA_InitSPItoRadio.Priority = LL_DMA_PRIORITY_VERYHIGH;
  DMA_InitSPItoRadio.PeriphRequest = LL_DMA_REQUEST_3;
 
  LL_DMA_Init(DMA2, LL_DMA_CHANNEL_2, &DMA_InitSPItoRadio);
  LL_DMA_EnableChannel(DMA2, LL_DMA_CHANNEL_2);    
 
  LL_SPI_EnableDMAReq_TX(SPI3);           // Enable SPI3 Tx request
}

Please note that on the big scale everything works. We transmit data over the air to the receiver OK. The example above says 16 bytes of data, but we transmit kilobytes when needed.

But we noticed that the time of data packet arrival at the receiver is delayed by the amount of time that corresponds to the duration of ony byte (at a given link rate - the frequency of the SPI input clock). This delay screwes up our timing, and that is how we finally traced it to the first extra byte which is sent twice by SPI.

What it (supposedly) means is that we don't have any glaring errors in configuration. The DMA and SPI work as expected, save for the very first byte.

Following your input, we investigated more. And the story now is even stranger. We looked at the radio operation at different SPI clock rates (different link rates). We found that at some rates the DMA/SPI work normally, without that extra byte. The strange thing here is that the micro has no knowledge about which clock frequency it will be getting from the RF transceiver. The configuration of DMA and SPI is exactly the same. This part of code is not aware of the clock frequency that will be supplied to the SPI after the tx operation starts.

Also, there is no obvious correlation between the SPI clock frequency and the "double first byte" phenomenon. Say, at SPI clocks of 115 kHz and 230 kHz, there is no problem. At 50 kHz, 170 kHz and 270 kHz, there is a problem. The CPU is running at 66 MHz, the SPI prescaler is 2 (default), as such, SPI and DMA should not have any issues working with such relatively low SPI speeds.

Playing with DMA priority, dmaBuffer[] alignment, number of bytes loaded into DMA does not change the situation. At some (but not all) frequencies of SPI clock, the SPI outputs one extra byte, always the very first byte.

As if there is some racing condition inside the SPI IP block when it works with DMA in slave mode and the clock starts. To be fair, we don't know if DMA has anything do to with it. Maybe it is something that can be reproduced without DMA. We haven't tried it yet because in the final product we must use DMA. I need to try to fix it with DMA.

If you have any idea what is happening, I would appreciate any suggestion. If not, I will likely have to try to isolate and replicate the issue on the NUCLEO dev board for the micro using only the board's resources and send the project to someone at ST in hope they could see the same thing.

Cheers!

S.Ma · ‎2020-01-31

How about the very first transaction vs the following ones? Is the first one ok and the next ones irregular? Is the issue repeatable?

Could it be an issue that NSS high level is not long enough for the DMA interrupt to complete before next NSS low edge?

In my case, I usually use EXTI interrupt on fall and/or rise edge of NSS in slave mode to reinit the DMA properly for next transaction.

Compare to the F103 and L4 how about the interrupt sources and priority? Same? Compiler same? (or compile optimisation different affecting the interrupt duration?)

sergey239955 · ‎2020-01-31

Every transaction shows this behavior. So the issue is absolutely repeatable. We transmit say every 30ms, for a duration of say 4ms. I can freely capture each transmission by using a logic analyzer on the SPI lines. Every transaction has the issue.

No, NSS is not used - we set LL_SPI_NSS_SOFT in the configuration. As such we don't have any potential for interrupts latencies/priorities related to SPI/DMA and NSS In fact, interrupts are not used for neither SPI nor DMA. There is no need for them, due to DMA and the system architecture.

The compiler is basically the same, but I am not sure how it would affect the SPI the way it does.

At the end of a transaction we disable SPI and disable the DMA channel until next time. The system has no memory about previous transactions. I think it is safe to assume that the problem is related to initialization for each and every transfer.