2023-04-17 12:02 PM - edited 2023-11-20 07:54 AM
I have to interface an STM32U575 microcontroller with a very specialized SPI device for which the market offers no alternatives. According to the spec sheet, the chip select must be pulsed between every 2 bytes, for at least 154 ns. I have to read 34 bytes from this chip every 50 us. The chip allows a maximum SPI clock rate of 24 MHz.
Although I can talk to the device, the SPI controller seems to not be fast enough.
SPI Configuration:
However, in my attempts to speed these transactions up, I have ditched the HAL for writing/reading to/from the SPI peripheral during the steady state program flow.
SPI functions:
void SPIM3_INIT(int len)
{
SPI3->CFG2 &= ~SPI_CFG2_COMM; // Set Full-Duplex mode
SPI3->CR2 = len >> 1; // Set the number of data at current transfer
SPI3->CR1 |= 1; // enable
}
int8_t SPIM3_XFER(uint8_t *txb, uint8_t *rxb, int len)
{
volatile uint16_t *ptxdr_16bits = (volatile uint16_t *)&SPI3->TXDR;
volatile uint16_t *prxdr_16bits = (volatile uint16_t *)&SPI3->RXDR;
uint16_t tx_cnt = len >> 1; // divide by 2 for 16 bit xfer
uint16_t rx_cnt = len >> 1; // divide by 2 for 16 bit xfer
SPI3->CR1 |= SPI_CR1_CSTART; // Master transfer start
while ((tx_cnt > 0UL) || (rx_cnt > 0UL))
{
if ((SPI3->SR & SPI_SR_TXP) && (tx_cnt > 0UL))
{
*ptxdr_16bits = *(const uint16_t *)txb;
txb += 2;
tx_cnt--;
}
if ((SPI3->SR & SPI_SR_RXP) && (rx_cnt > 0UL))
{
*((uint16_t *)rxb) = *prxdr_16bits;
rxb += 2;
rx_cnt--;
}
}
while (!(SPI3->SR & SPI_SR_EOT))
;
SPI3->CR1 &= ~SPI_CR1_CSTART;
SPI3->IFCR |= SPI_IFCR_EOTC; // clear end of transfer flag
SPI3->IFCR |= SPI_IFCR_TXTFC; // clear transmission xfer filled flag
return len;
}
void SPIM3_UNINIT()
{
SPI3->IFCR |= SPI_IFCR_EOTC; // clear end of transfer flag
SPI3->IFCR |= SPI_IFCR_TXTFC; // clear transmission xfer filled flag
SPI3->CR1 &= ~SPI_CR1_SPE; // disable peripheral
SPI3->IER = 0; // disable interrupts
// disable tx dma request
SPI3->CFG1 &= ~(SPI_CFG1_TXDMAEN | SPI_CFG1_RXDMAEN);
}
main:
uint8_t tbuf[2] = {0};
uint8_t rbuf[2];
for (int i = 0; i < 2; i++)
tbuf[i] = i;
SPIM3_INIT(2);
while (1)
SPIM3_XFER(tbuf, rbuf, 2);
SPIM3_UNINIT();
The problem is that I am seeing a large period of inactivity between transfers (about 900 ns), as shown in this logic analyzer trace:
In ideal conditions, I should be able to get the desired throughput with 20 MHz SPI. 2 bytes at 20 MHz take 800 ns to transfer, plus the 154 ns pulse --> 954 ns. Multiply by 34 bytes means I should be able to get this done in about 32 us, meeting my 50 us deadline. Actually, I should be able to get away with up to 670 ns delay between 2-byte transfers (not ideal), but 900ns will certainly not work for me.
Where is this delay coming from? Is it a function of the SPI peripheral architecture? Can my code be optimized further in some way? Can I double buffer the data? Where in the spec sheet can I find reference to these limitations? I am willing to make memory and power trade-offs to increase the bandwidth. Is there some workaround I can use to achieve higher SPI throughput given the limitations of the device I am interfacing with?
2023-04-17 02:44 PM
Disclaimer: I haven't used the U5 and don't know how/if its SPI differs from the F4/L4/G4/F7 families.
Do you have any SPI interrupts enabled? Use another GPIO to measure time spent in SPIM3_XFER(). Set pin before call, clear the pin after the call. Then try just before setting CR1_START and just after clearing it.
2023-04-17 04:04 PM
I haven't dealt with that kind of timing, but structure wise, and software wise, I'd consider a slight rewrite but with a different structure.
Firstly, I'm not sure that I like the way that the HAL level drivers work with the CS when under hardware control. Granted, you're not using them, but it makes it impossible to have the CS down during multiple calls to the driver.
What I'd suggest, at least for the two byte solution (and I don't see anything about CS in the code) would be to write a routine that
You might want to consider the LL drivers, or if you're happy with the drivers you have, just go with them.
I'm assuming you don't have interrupts working on this. I can't quite see how DMA would work here, given the timing. I'm also assuming you do not have an operating system, which must be considered.
You don't necessarily have much time left over to do much, though.
2023-04-17 04:16 PM - edited 2023-11-20 07:54 AM
@bob S Thank you for the suggestion! With your advice, I optimized my code to this (where SPI_INIT function is equivalent to SPIM3_INIT as shown above):
uint8_t buf_tx[2] = {0xAA, 0x55};
uint8_t buf_rx[2] = {0};
SPI_INIT();
while (1)
{
TP0_GPIO_Port->BSRR = TP0_Pin;
SPI3->CR1 |= SPI_CR1_CSTART; // Master transfer start
SPI3->TXDR = *(uint16_t *)&buf_tx[0];
*((uint16_t *)&buf_rx[0]) = SPI3->RXDR;
while (!(SPI3->SR & SPI_SR_EOT))
;
SPI3->CR1 &= ~SPI_CR1_CSTART;
SPI3->IFCR |= SPI_IFCR_EOTC; // clear end of transfer flag
SPI3->IFCR |= SPI_IFCR_TXTFC; // clear tx xfer filled flag
TP0_GPIO_Port->BRR = TP0_Pin;
}
SPI_UNINIT();
This produces the following waveform (channel 7 is the test pin):
With those changes it looks like a SPI transfer can clock in at ~1.36 us. Times 34 transfers --> 46.24 us. Which is just under the my 50 us deadline. Big improvement, but still not really good enough.
I haven't been using interrupts. I thought that context switching between interrupt and thread mode, especially when using such a small packet size, would give even worse performance. That is what I saw when using interrupts with the HAL, and I didn't bother to optimize out the callback functions, etc. Do you think I would see improved performance with interrupts? I have also toyed with the idea of using DMA somehow, but again, with the small packet size, I'm not sure if it will be worth it. The TX data that is sent will be constant so I may be able to find some optimization there.
2023-04-17 05:08 PM
Another thing I have noticed is that changing the MIDI bits in SPI3->CFG2 doesn't seem to affect this delay. I have been setting them to zero, where I would expect NO delay. When I set them to 0xF, I see the same thing. I've verified they have been set/unset by reading the contents of the register
2023-04-17 05:25 PM
I think you're wasting time on the init/uninit cycle. Do you need that?
I'd init once, then send the data, and UNINIT, if you need it. If the interface is dedicated, then why bother? Just don't send.
My idea was to lock the processor into a receive/transmit loop.
Give it a byte count and a pointer.
while the bytes remaining > 1; send two bytes one at a time, mandatory delay between bytes if needed at all.
after each 2 byte burst, do the mandatory delay, decrement count.
if count is 0, then done; if count is 1, then send one byte outside of the loop, if count >= 2, then stay in loop. Remember that the loop sends two bytes then delays, preserving the required minimum delay.
You're transmitting the whole thing as one block.
For DMA, I don't think (and could be very wrong) that DMA can result in the timing you need. Neither could an IRQ unless you're going to do a 2 block transfer in one IRQ, and that's going to be awkward to implement. The two byte timing hints strongly at programmed I/O.
For sending the TX data, since you're only reading, you can transmit a pre-programmed constant and pick it from a "I don't care" list on the chip, but that depends on what the chip wants to see. Presumably, unless the chip is fixed into a read/only mode, you have to send a command/address and then start reading.
2023-04-18 11:03 AM
This doesn't make sense to me. The code in your post generated the signals in that image? It looks like it takes around 550ns between your BRR at the end of the while() and the BSRR at the top of the while() loop. There should only be a handful of CPU instructions (at most) between those two lines, and that shouldn't take anywhere NEAR that long.
Hmmmm - what is you SYSCLK frequency? What compiler optimization are you using? Look at the assembly listing and see what code is generated.
Try un-rolling your loop - put 2 (or 3 or 4) copies of the code in a row inside the while() loop and see if you get shorter delays between them. For example:
while (1)
{
TP0_GPIO_Port->BSRR = TP0_Pin;
SPI3->CR1 |= SPI_CR1_CSTART; // Master transfer start
SPI3->TXDR = *(uint16_t *)&buf_tx[0];
*((uint16_t *)&buf_rx[0]) = SPI3->RXDR;
while (!(SPI3->SR & SPI_SR_EOT))
;
SPI3->CR1 &= ~SPI_CR1_CSTART;
SPI3->IFCR |= SPI_IFCR_EOTC; // clear end of transfer flag
SPI3->IFCR |= SPI_IFCR_TXTFC; // clear tx xfer filled flag
SPI3->CR1 |= SPI_CR1_CSTART; // Master transfer start
SPI3->TXDR = *(uint16_t *)&buf_tx[0];
*((uint16_t *)&buf_rx[0]) = SPI3->RXDR;
while (!(SPI3->SR & SPI_SR_EOT))
;
SPI3->CR1 &= ~SPI_CR1_CSTART;
SPI3->IFCR |= SPI_IFCR_EOTC; // clear end of transfer flag
SPI3->IFCR |= SPI_IFCR_TXTFC; // clear tx xfer filled flag
TP0_GPIO_Port->BRR = TP0_Pin;
}
2024-06-28 03:47 AM - edited 2024-06-28 03:47 AM
> Another thing I have noticed is that changing the MIDI bits in SPI3->CFG2 doesn't seem to affect this delay.
Did you (or HAL) set SPIx_CFG2->SSOM ?
2024-07-19 05:22 AM
Hi,
I'm afraid that we are facing the similar problem. In our situation, From the calling of the HAL RX api, to the SCK starting toggling, the time is beyond 1us!!!!!!
It seems U5 not only decreases the power consumption, but also the performance!
2024-07-19 09:29 AM
@diverger , this thread is 3 months old and OP has disappeared without responding to 3 different suggestions. I suggest you read these suggestions carefully, try them and report back.