32F417 SPI running at half the speed it should

PHolt.1 · ‎2022-01-11

Hello,

I am using a chopped-down function from the Cube HAL code to talk to a serial FLASH. The original function has a mass of options which were checked at runtime - the usual thing for HAL code...

It's all working fine. SPI2, 21MHz (the fastest I can go with PCLK1=PCLK2=42MHz).

But testing the speed, I am seeing 600us time taken to read 512 bytes. The SPI speed limit should be about 200us. The actual time, obtained by waggling a pin, is 600us. Can anyone see anything obviously wrong with this code? As I say, it works perfectly.

// Used only for SPI2. Mode is always SPI_MODE_MASTER and 2LINE.
 
//__attribute__((optimize("O0")))
HAL_StatusTypeDef B_HAL_SPI_TransmitReceive(SPI_HandleTypeDef *hspi, uint8_t *pTxData, uint8_t *pRxData, uint16_t Size)
{
  uint16_t             initial_TxXferCount;
  uint32_t             tmp_mode;
  HAL_SPI_StateTypeDef tmp_state;
 
  /* Variable used to alternate Rx and Tx during transfer */
  uint32_t             txallowed = 1U;
  HAL_StatusTypeDef    errorcode = HAL_OK;
 
  /* Init temporary variables */
  tmp_state           = hspi->State;
  tmp_mode            = hspi->Init.Mode;
  initial_TxXferCount = Size;
 
  if (!((tmp_state == HAL_SPI_STATE_READY) || \
        ((tmp_mode == SPI_MODE_MASTER) && (hspi->Init.Direction == SPI_DIRECTION_2LINES) && (tmp_state == HAL_SPI_STATE_BUSY_RX))))
  {
    errorcode = HAL_BUSY;
    goto error;
  }
 
  if ((pTxData == NULL) || (pRxData == NULL) || (Size == 0U))
  {
    errorcode = HAL_ERROR;
    goto error;
  }
 
  /* Don't overwrite in case of HAL_SPI_STATE_BUSY_RX */
  if (hspi->State != HAL_SPI_STATE_BUSY_RX)
  {
    hspi->State = HAL_SPI_STATE_BUSY_TX_RX;
  }
 
  /* Set the transaction information */
  hspi->ErrorCode   = HAL_SPI_ERROR_NONE;
  hspi->pRxBuffPtr  = (uint8_t *)pRxData;
  hspi->RxXferCount = Size;
  hspi->RxXferSize  = Size;
  hspi->pTxBuffPtr  = (uint8_t *)pTxData;
  hspi->TxXferCount = Size;
  hspi->TxXferSize  = Size;
 
  /*Init field not used in handle to zero */
  hspi->RxISR       = NULL;
  hspi->TxISR       = NULL;
 
  /* Check if the SPI is already enabled */
  if ((hspi->Instance->CR1 & SPI_CR1_SPE) != SPI_CR1_SPE)
  {
    __HAL_SPI_ENABLE(hspi);
  }
 
  	/* Transmit and Receive data in 8 Bit mode */
 
    // The need for this initial byte is unknown
    if (initial_TxXferCount == 0x01U)
    {
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
      hspi->pTxBuffPtr++;
      hspi->TxXferCount--;
    }
 
    while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
    {
      /* Check TXE flag */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
      {
        *(__IO uint8_t *)&hspi->Instance->DR = (*hspi->pTxBuffPtr);
        hspi->pTxBuffPtr++;
        hspi->TxXferCount--;
        /* Next Data is a reception (Rx). Tx not allowed */
        txallowed = 0U;
      }
 
      /* Wait until RXNE flag is reset */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
      {
        (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
        hspi->pRxBuffPtr++;
        hspi->RxXferCount--;
        /* Next Data is a Transmission (Tx). Tx is allowed */
        txallowed = 1U;
      }
 
    }
 
    // Clear overrun flag in 2 Lines communication mode because received is not read
    // For 45DBxx the Init.Direction is always SPI_DIRECTION_2LINES and is set up in b_main.c
    //if (hspi->Init.Direction == SPI_DIRECTION_2LINES)
    //{
    	__HAL_SPI_CLEAR_OVRFLAG(hspi);
    //}
 
error :
  hspi->State = HAL_SPI_STATE_READY;
  return errorcode;
}

PHolt.1 · ‎2022-01-12

It always does happen concurrently. You cannot receive a byte without sending one because it is only sending one (I am using only Master mode) that generates the clocks.

DMA is ****** complicated (I've done it for the waveform generator project and it worked only after somebody spent a fair bit of time on it) and it will certainly not work if a software loop like this cannot be made to work right.

BTW is there a config on this forum which avoids the "read more" at the bottom?

It would amaze me if ints were enabled because the NULL for the ISRs would crash the whole thing. Anyway, I did a tight loop at the start of the startup code, after SPI2 is initialised, and get the same thing.

Tesla DeLorean · ‎2022-01-12

But you can push data in the transmit buffer, when it's empty, and it will generate the clocks to pull out the data on the backside.

DMA isn't going to enforce your TX-ALLOWED, it will provide TWO channels, one that will feed when TXE signals, and one that will consume when RXNE signals.

>>BTW is there a config on this forum which avoids the "read more" at the bottom?

No, they have this worthless software for SalesForce that was written by a 3 year old, a dumber one,..

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

PHolt.1 · ‎2022-01-12

The thing is, I've been writing code for this board for a year now, and been timing various sections of it (by measuring inter-GPIO waggles) and that 800ns can't be disappearing into thin air. That is 134 CPU clocks, and I can't see anything in that code taking more than roughly 1/10 of that, and anyway most of it is pipelined behind the UART shifting the bytes in/out. Looking at the disassembly shows roughly 4-5 assembler instructions for each C line which is nowhere near enough to use up that many CPU clocks.

I don't see the point of tx-allowed. It's just too subtle for my brain : - ) Maybe the need for that flag is the key to this issue.

And it would not surprise me if nobody had discovered this before, because the result works and if stuff works, nobody will be checking it with a scope. SPI is sync comms; there is no "baud rate".

If you have any ideas I am all ears, but I will be away for a few days now.

waclawek.jan · ‎2022-01-12

> [why txallowed]

Consider this scenario: First write to DR fills the Tx buffer. SPI starts to transmit, the Tx buffer is thus empty, if txallowed is not checked, second write goes to the buffer. Then an interrupt cuts in, lasting longer than it takes to transmit both bytes. The result is, that there is one correctly received byte, and the second - as it finishes receiving before the first one has been read out - overflows and gets lost.

It's an afterthought and it prevents the Tx buffer to be used effectively. Because of the underlying problem (ie. that receiver can't hold 2 bytes at once), the possibility of utilizing the Tx buffer in polled implementation is limited, nevertheless better solutions are possible but would require rethinking how the routine works. Rewriting it in asm would probably help, too. Or, interrupts would need to be disabled, but then there goes universality (and still things like collisions with other busmasters would need to be considered, too).

> That is 134 CPU clocks

> I can't see anything in that code taking more than roughly 1/10 of that

That you can't see it, does not mean it does not exist. Maybe you should look harder.

> and anyway most of it is pipelined behind the UART shifting the bytes in/out.

I don't know what does this mean.

But you should stop using "UART" for the shift register (even USART is not approriate in this case, IMO).

> Looking at the disassembly shows roughly 4-5 assembler instructions for each C line which is nowhere near enough to use up that many CPU clocks.

Show.

For example, this

if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))

is likely to amount to 3 C lines, so maybe a dozen of instructions. And they don't execute in one CPU clock each. For example one of the first of those instructions reads from the SPI status register. That means, that the read request has to traverse the AHB-to-APB bridge and the data traverse the same bridge in the opposite direction. This is likely to last one to three APB cycles plus some AHB cycles, and given this is the "slow" APB at 1/4 of AHB, it means maybe a dozen of system cycles. And still the given flag has to be separated (maybe masked) and a conditional jump has to be performed.

I took your function and got it running on an 'F429 Disco at 180MHz system clock. And inserted the pin toggling (used the green LED on the Disco):

        while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
        {
PIN_SET(GREEN_LED);
          /* Check TXE flag */
          if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
          {
            *(__IO uint8_t *)&hspi->Instance->DR = (*hspi->pTxBuffPtr);
            hspi->pTxBuffPtr++;
            hspi->TxXferCount--;
            /* Next Data is a reception (Rx). Tx not allowed */
            txallowed = 0U;
          }
 
PIN_CLR(GREEN_LED);
          /* Wait until RXNE flag is reset */
          if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
          {
            (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
            hspi->pRxBuffPtr++;
            hspi->RxXferCount--;
            /* Next Data is a Transmission (Tx). Tx is allowed */
            txallowed = 1U;
          }
 
        }

(Yes, here I have an even worse probe than you). Yellow is SCK, purple is the green LED, timebase is 200ns/div. So we start with a cca 220ns (40 cyc) positive pulse which indicates the write to SPI->DR, then during transmission we see the alternating polling of Tx and Rx flags (even if TXE is set after the first SCK pulse, there's no more write to SPI->DR due to txallowed being zero), cca 100 ns each, maybe 15-20cyc. After the last SCK pulse, RXNE gets set, that's why we see a long negative pulse where SPI->DR gets read (cca 300ns, 55cyc); as that sets txallowed, a long positive pulse for SPI->DR write follows (again cca 220ns/40cyc).

The compiler decided to do the green LED GPIO_BSRR writes from registers, so those are truly single-cycle and don't impact the total timing too much.

Following is disasm of the portion where RXNE is tested and SPI->DR written, that together with the test at the beginning of loop lasts cca 300ns/55cyc, it's around 25 instructions. Given there are two reads from the slow APB bus included (amounting to perhaps two dozens of cycles), I think this is entirely justified (btw. all memory reads/writes go from/to the CCM, so they *are* single-cycle):

PIN_CLR(GREEN_LED);
 8000342:	6184      	str	r4, [r0, #24]            // write to GPIO_BSRR for LED to go down
          /* Wait until RXNE flag is reset */
          if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
 8000344:	689f      	ldr	r7, [r3, #8]            // read from SPI_SR
 8000346:	07ff      	lsls	r7, r7, #31           // check bit 0
 8000348:	d50f      	bpl.n	800036a <main+0x132>  // jump to beginning of loop, if not set
 800034a:	f8b2 8012 	ldrh.w	r8, [r2, #18]       // read RxXferCount 
 800034e:	fa1f f788 	uxth.w	r7, r8              // extend to word (RxXferCount is uint16_t)
 8000352:	b157      	cbz	r7, 800036a <main+0x132>  // compare-to-zero-and-branch
            (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
 8000354:	68ce      	ldr	r6, [r1, #12]           // read pRxBuffPtr
 8000356:	68db      	ldr	r3, [r3, #12]           // read SPI->DR
 8000358:	7033      	strb	r3, [r6, #0]          // write to *pRxBuffPtr
            hspi->RxXferCount--;
 800035a:	8a4e      	ldrh	r6, [r1, #18]        // read RxXferCount
            hspi->pRxBuffPtr++;
 800035c:	68cb      	ldr	r3, [r1, #12]          // read pRxBuffPtr (yeah gcc is not the brightest compiler out there)
            hspi->RxXferCount--;
 800035e:	3e01      	subs	r6, #1               // decrement RxXferCount -- the reorder is here probably in hope that processor can perform math on r6 while the previous read of r3 gets finalized, but Cortex-M4 is not superscalar
 8000360:	b2b6      	uxth	r6, r6               // and extend to word (again this might be optimized out by compiler)
            hspi->pRxBuffPtr++;
 8000362:	3301      	adds	r3, #1               // increment pRxBuffPtr
            hspi->RxXferCount--;
 8000364:	824e      	strh	r6, [r1, #18]        // store RxXferCount
            hspi->pRxBuffPtr++;
 8000366:	60cb      	str	r3, [r1, #12]          // store pRxBuffPtr
            txallowed = 1U;
 8000368:	2601      	movs	r6, #1               // set txallowed (it's kept in register)
 
 
        while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
 800036a:	8953      	ldrh	r3, [r2, #10]        // and this is the beginning of loop (compiler physically placed it at the end)
 800036c:	b29b      	uxth	r3, r3
 800036e:	b913      	cbnz	r3, 8000376 <main+0x13e>  
 8000370:	8a4b      	ldrh	r3, [r1, #18]
 8000372:	b29b      	uxth	r3, r3
 8000374:	b1bb      	cbz	r3, 80003a6 <main+0x16e>  // jump out of the loop if both counts are zero
          if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
 8000376:	6813      	ldr	r3, [r2, #0]           // this prepares  hspi->Instance into r3 
PIN_SET(GREEN_LED);
 8000378:	6185      	str	r5, [r0, #24]         // and here we set LED up again

> And it would not surprise me if nobody had discovered this before, because the result works and if stuff works, nobody will be checking it with a scope

Well, nobody who expects code to perform, uses polling. Even more, this is Cube, i.e. a vehicle to sell chips to managers, certainly not a path towards producing good code. That path is long, arduous and expensive, so somebody has to walk it, spend the time and pay the price.

JW

PHolt.1 · ‎2022-01-12

Thanks for the test.

I did wonder if the APB bus speed introduces some long delay.

So what's the solution?

The TX should hold 2 bytes. One loaded initially goes straight into the shift reg, SPI_FLAG_TXE should go to 1 immediately, and the other byte can be loaded immediately afterwards (checking SPI_FLAG_TXE just in case). So it seems possible to keep TX full, and if you keep TX full, you should get solid clocking.

I tried that but it didn't work for some reason.

It does seem obvious that for whatever reason the TX is discontinuous. If that can be solved, that will be the fix. The loop needs to focus on stuffing data into the TX at every opportunity.

I don't see the point in the 1 byte count special handling if not using interrupts.

I will be away for a few days unfortunately and while I will be online I won't be able to test any code.

PHolt.1 · ‎2022-01-12

Isn't the issue in two parts?

1) The txallowed flag is preventing filling of the TX pipeline (which can hold 2 bytes at the same time). It seems they are using that flag to prevent RX overflow.

2) The APB clock speed affects the speed of reads and writes to SPI. I don't know what you ran yours at but mine are 42MHz. I didn't realise just how slow peripheral accesses can be on the 32F4.

I do need a solution which works in the presence of interrupts.