SPI timing

bsakas · ‎2023-06-23

Hello,

I am familiarizing myself with the SPI controller on a NUCLEO-F303K8 development board using the CMSIS device headers. I have written a program to send and receive a single byte using polling or interrupts. While functionally it appears correct, I am a bit confused on the timing diagrams generated by my logic analyzer.

Overall configuration: System clock is HSI at 8MHz. SPI clock has default prescaler of 2, so 4MHz.

The above is the result for a polling transaction.

Interval 0: SPI enable to CS low, 6 cycles - makes sense

Interval 1: CS low to shift register filled with tx data, 3 cycles (1 CPU cycle to store, 1 SPI cycle to put it on the line?)

Interval 2: Shift register filled with tx data to transaction start, 3 cycles - not sure what happens here

Interval 3: The transaction, 16 CPU/8 SPI cycles - makes sense

Interval 5 (manually added): Reading the shift register and setting CS low - should be 4 cycles (marked it wrong in diagram)

As the SPI is configured with CPOL = CPHA = 0, I would expect RXNE to go high as soon as Marker 4 as the last rx bit is latched, but I suppose there could be more overhead with the FIFO. The left end of Interval 6 is where I would expect the CPU to find RXNE high (a few cycles in a loop), read the shift register, and set CS low, but this does not happen until Interval 5 - around 12 cycles later. Interval 6 is what I'm interested in. I found similar results using interrupts to respond to RXNE.

Above is the interrupt transaction. Similar intervals are highlighted, although Interval 5 and Interval 4 make up the expected Cortex M4 interrupt latency (12 cycles) and reading the shift register/setting CS low, respectively. The interval of interest here is Marker 3 to the left end of Interval 5.

Finally, here is the code.

#include <stdint.h>
#include "stm32f3xx.h"
#include "core_cm4.h"

/*
 * SCK = PA5, MOSI = PB5, MISO = PB4, CS = PB6
 */

 #define GPIO_MODE_OUTPUT 0b01
 #define GPIO_MODE_ALTERNATE 0b10
 #define GPIO_SPEED_HIGH 0b10
 #define GPIO_PULLUP 01
 #define GPIO_SPI_ALTERNATE 5

/*
 * SCK: PA5, push-pull, high speed, no pull, alternate function mode
 * MOSI: PB5, push-pull, high speed, no pull, alternate function mode
 * MISO: PB4, push-pull, high speed, no pull, alternate function mode
 * CS: PB6, push-pull, high speed, pullup, output mode
 */
void gpio_init(void)
{
  GPIOA->MODER |= (GPIO_MODE_ALTERNATE << GPIO_MODER_MODER5_Pos);
  GPIOA->OTYPER = 0;
  GPIOA->OSPEEDR = (GPIO_SPEED_HIGH << GPIO_OSPEEDER_OSPEEDR5_Pos);
  GPIOA->PUPDR = 0;
  GPIOA->AFR[0] = (GPIO_SPI_ALTERNATE << GPIO_AFRL_AFRL5_Pos);

  GPIOB->MODER = (GPIO_MODE_ALTERNATE << GPIO_MODER_MODER4_Pos)
               | (GPIO_MODE_ALTERNATE << GPIO_MODER_MODER5_Pos)
               | (GPIO_MODE_OUTPUT << GPIO_MODER_MODER6_Pos);
  GPIOB->OTYPER = 0;
  GPIOB->OSPEEDR = (GPIO_SPEED_HIGH << GPIO_OSPEEDER_OSPEEDR4_Pos)
                 | (GPIO_SPEED_HIGH << GPIO_OSPEEDER_OSPEEDR5_Pos)
                 | (GPIO_SPEED_HIGH << GPIO_OSPEEDER_OSPEEDR6_Pos);
  GPIOB->PUPDR = (GPIO_PULLUP << GPIO_PUPDR_PUPDR6_Pos);
  GPIOB->AFR[0] = (GPIO_SPI_ALTERNATE << GPIO_AFRL_AFRL4_Pos)
                | (GPIO_SPI_ALTERNATE << GPIO_AFRL_AFRL5_Pos);
  GPIOB->BSRR = GPIO_BSRR_BS_6;
}

/*
 * Master mode, software NSS management, 8-bit receive FIFO threshold
 */
void spi_init(void)
{
  SPI1->CR1 = SPI_CR1_MSTR | SPI_CR1_SSM | SPI_CR1_SSI;
  SPI1->CR2 = SPI_CR2_FRXTH;
  SPI1->CR1 |= SPI_CR1_SPE;
}

/*
 * Polled SPI transaction
 */
void spi_tx_rx_poll(void)
{
  GPIOB->BRR = GPIO_BRR_BR_6;
  *((volatile uint8_t*)&SPI1->DR) = 0xAA;
  while (!(SPI1->SR & SPI_SR_RXNE)) {}
  uint8_t rx = *((volatile uint8_t*)&SPI1->DR);
  GPIOB->BSRR = GPIO_BSRR_BS_6;
}

/*
 * Interrupt SPI transaction
 */
void spi_tx_rx_int(void)
{
  SPI1->CR2 |= SPI_CR2_RXNEIE;
  NVIC_EnableIRQ(SPI1_IRQn);

  GPIOB->BRR = GPIO_BRR_BR_6;
  *((volatile uint8_t*)&SPI1->DR) = 0xAA;
}

void SPI1_IRQHandler(void)
{
  uint8_t rx = *((volatile uint8_t*)&SPI1->DR);
  GPIOB->BSRR = GPIO_BSRR_BS_6;
}

int main(void)
{
  RCC->AHBENR = RCC_AHBENR_GPIOAEN | RCC_AHBENR_GPIOBEN;
  RCC->APB2ENR = RCC_APB2ENR_SPI1EN;

  gpio_init();
  spi_init();
  spi_tx_rx_poll();
  spi_tx_rx_int();

  while (1) {}
}

Thanks in advance for any clarification on these timings, and let me know if there's any other information I can provide or experiments I could run.

waclawek.jan · ‎2023-06-23

This is not your friendly microcontroller anymore, it's an SoC, with all the consequences.

First, you should be looking at disasm, not the C source.

Then, instruction execution is not single-cycle. You may have cheated the FLASH latency by running at low clock, but jumps take more than single cycle. Plus, whatever data are read by instruction, the instruction must wait until the system provides that data (writes too, but they generally are faster due to buffers at various points of system, so you won't see slowdown until you write twice in a row).

SPI is on an APB bus, beyond the S-port of processor, the bus-matrix, and an AHB-to-APB bridge. Add one cycle for each of these, maybe two for the AHB-APB bridge for reading (?) these things are not very well publicly documented (read: not at all).

Welcome to the glorious world of 32-bitters.

JW

View solution in original post

waclawek.jan · ‎2023-06-23

This is not your friendly microcontroller anymore, it's an SoC, with all the consequences.

First, you should be looking at disasm, not the C source.

Then, instruction execution is not single-cycle. You may have cheated the FLASH latency by running at low clock, but jumps take more than single cycle. Plus, whatever data are read by instruction, the instruction must wait until the system provides that data (writes too, but they generally are faster due to buffers at various points of system, so you won't see slowdown until you write twice in a row).

SPI is on an APB bus, beyond the S-port of processor, the bus-matrix, and an AHB-to-APB bridge. Add one cycle for each of these, maybe two for the AHB-APB bridge for reading (?) these things are not very well publicly documented (read: not at all).

Welcome to the glorious world of 32-bitters.

JW

RhSilicon · ‎2023-06-23

This MCU can run at 72MHz, does it also show those times for 72MHz?

bsakas · ‎2023-06-23

Thanks for the detailed response; I figured there was a lot of complexity that I made too many assumptions about. I was primarily looking at the disasm when I made the post (just posted the C source because I don't know if people want to see the asm,) but as you have said, not all instructions will complete in a cycle and I did not take into account the peripheral bridges. I was indeed cheating the flash latency by running at 8MHz, so I didn't consider any wait states.

The article you linked was very helpful; I definitely underestimated the complexity of these systems and will have to rethink the way I work on them. That being said, is there actually any documentation on these sorts of low-level things? I know, for example, these buses are described in the AMBA specifications and ports in the architecture manuals, but manufacturers have their own implementations, right? If there is documentation, will reading it allow me to know the exact latencies? Does knowing about these things allow me to write better code to take advantage of it, or is it simply too obscure/insignificant to worry about? Premature optimization is evil and all; I might just be in over my head.

bsakas · ‎2023-06-23

I have only used the HSI as system clock in my projects, so I would need to look into how to use the PLL. However, according to JW's response, if I ran the CPU at 72MHz and SPI at 36MHz, I would see delays proportional to the delays above (just scaled down) with added flash fetch delays. If I ran the CPU at 72MHz and kept SPI at say, 4MHz, then the delays should be much smaller. It looks like it is just intrinsic to the architecture.

STOne-32 · ‎2023-06-23

Dear @bsakas ,

First, thank you the initial post which is very instructive on the way to debug the timings for the SPI and match with the inline direct C registers access with CPU cycles ( either the main or IRQ). Big Thanks also @waclawek.jan . Indeed the right correspondence should be using generated assembly code and so you can map : C / Assembly / SPI signals.

Things will become different if you use DMA instead of CPU and here we have this application Note that gives some details on the cycles when going thru the Bridges and overall architecture. Using the STM32F0/F1/F3/Cx/Gx/Lx Series DMA controller - Application note Hope it helps you.

Ciao

STOne-32

bsakas · ‎2023-06-23

Thank you, this document was very insightful and exactly what I wanted to know more about.

I forgot to consider a DMA transfer as well. I set up a routine to use the RX DMA channel and transfer complete interrupt to transmit and receive a byte, which should be functionally equivalent to using RXNE and the SPI interrupt. I found similar results to the normal interrupt method, but DMA will obviously be the go-to for transfers with more than a few bytes.

waclawek.jan · ‎2023-06-23

> I don't know if people want to see the asm,

Yes they do if it's relevant.

> is there actually any documentation on these sorts of low-level things?

No, just snippets here and there. And then you can also experiment, as you did, although results are hard to interpret, as there are always several things in play at once, so be very careful before you jump to conclusions.

> I know, for example, these buses are described in the AMBA specifications and ports in the architecture manuals, but manufacturers have their own implementations, right?

Yes, and those are generally not publicly documented. One of those snippets is in the DMA appnotes (for 'F3 the single-port one STOne linked above, but the dual-port DMA's AN contains similar snippets of wisdom too). Another such snippet is the mention of one-cycle delay on reading through S-port of processor in some of the ARM's documents, another one is the fact that you can switch off the write buffer at that port and the consequences are observable.

> Does knowing about these things allow me to write better code to take advantage of it, or is it simply too obscure/insignificant to worry about?

Better code probably not, but being aware of them prevents you from being shocked when you come across it consequences. As mentioned above, you'll start to consider more hardware solutions such as using DMA to transfer data or timer to toggle pins at precise intervals, but also understand that some of those (namely DMA, as it heavily relies exactly on the same bus system with the same latencies and wait-mechanisms and collisions with other masters etc.) have the same limitations.

> higher clocks

Besides FLASH latency (and consequently effects of any mechanism out there to mitigate it - prefetch, single-line-cache, jumpcache/datacache a.k.a. ART, full-fledged L1 cache in 'F7) on some STM32 you are also forced to divide the clock for the APB buses, and then you'll start to see various effects stemming from both the AHB/APB bridge synchronization, and delays/resynchronization between various elements within the design, most known of them is probably the interrupt-source-clear-arrives-late-to-NVIC-causing-interrupt-reentry problem.

Generally, the "better" and faster the mcu gets, the more pronounced these effects are. One which we see here to cause surprise most often is the GPIO toggle in 'H7 surprisingly slow one.

JW