STM32F103 HAL_SPI_TransmitReceive throughput much lower than expected – non-continuous SCK

Zaeem-Ahmed · ‎2026-01-14

Hello everyone,

I am using SPI2 of STM32F103RCT6 to communicate with an AT45DB641E DataFlash device. I am trying to understand why the effective SPI data rate is significantly lower than what I expect based on the configured SPI clock.

Hardware setup

MCU: STM32F103RCT6 (Cortex-M3)
SPI instance: SPI2
SPI clock: 9 MHz
Data frame: 8 bits
SPI mode: 2-line unidirectional (full-duplex hardware)
NSS: Software-controlled GPIO
Core clock: 72 MHz
Custom PCB (STM32 + AT45 connected directly)
Schematics attached

Test description

I am reading the status register of the AT45DB641E.
Once the read command is sent and SS is held low, the flash continuously outputs two status bytes as long as clock pulses are provided.

I measured the execution time of HAL_SPI_TransmitReceive() by placing timestamps before and after the function call, using the DWT cycle counter (DWT->CYCCNT) available on the Cortex-M3.

Here is the code used for testing, the tx and rx buffer are globally defined of size 5000 and datatype uint8_t:

  /* Enable trace and debug block */
  CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
  /* Reset the cycle counter */
  DWT->CYCCNT = 0;
  /* Enable the cycle counter */
  DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

  uint32_t previous_count=0;
  uint32_t diff=0;
  tx_buffer[0] = status_reg_read_opcode;
  // starting to receive status byte continuously
  HAL_GPIO_WritePin(GPIOB, GPIO_PIN_12, GPIO_PIN_RESET);
  previous_count = DWT->CYCCNT;
  HAL_SPI_TransmitReceive(&hspi2, tx_buffer, rx_buffer, 5000, 1000);
  diff = DWT->CYCCNT - previous_count;
  HAL_GPIO_WritePin(GPIOB, GPIO_PIN_12, GPIO_PIN_SET);


static void MX_SPI2_Init(void)
{
  /* SPI2 parameter configuration*/
  hspi2.Instance = SPI2;
  hspi2.Init.Mode = SPI_MODE_MASTER;
  hspi2.Init.Direction = SPI_DIRECTION_2LINES;
  hspi2.Init.DataSize = SPI_DATASIZE_8BIT;
  hspi2.Init.CLKPolarity = SPI_POLARITY_LOW;
  hspi2.Init.CLKPhase = SPI_PHASE_1EDGE;
  hspi2.Init.NSS = SPI_NSS_SOFT;
  hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_4;
  hspi2.Init.FirstBit = SPI_FIRSTBIT_MSB;
  hspi2.Init.TIMode = SPI_TIMODE_DISABLE;
  hspi2.Init.CRCCalculation = SPI_CRCCALCULATION_DISABLE;
  hspi2.Init.CRCPolynomial = 10;
  if (HAL_SPI_Init(&hspi2) != HAL_OK)
  {
    Error_Handler();
  }
}

I have also attached my full main.c file.

Measured results

Below is the data I collected manually from the Expressions window in STM32CubeIDE (debug mode).

No of bytes	cycles taken	actual time (us)	calculated datarate bytes/s	calculated time (us)	time difference	scaling factor (actual / cal)
1000	270711	3759.8	1152000	888.8	2870.9	4.229
5000	1351484	18770.6	1152000	4444.4	14326.2	4.223

From this, it is clear that the actual transfer time is ~4.2× slower than the theoretical SPI transfer time.

Logic analyzer observations

I then captured the SPI signals using a logic analyzer (PulseView .sr file attached can be opened in pulseview software).

Key observations:

The SCK is not continuous
The clock appears to be generated byte-by-byte
There is a noticeable gap between successive bytes, almost 2.9 us.
It looks like software is triggering the clock rather than the SPI peripheral running continuously (Please correct me if my interpretation is wrong.)

My current understanding / assumptions

SPI clock is generated only when data is written to SPI->DR
To receive data, the master must transmit dummy bytes
The slave (AT45) only shifts data when clock is present

In HAL_SPI_TransmitReceive(), the while loop:

Polls TXE
Polls RXNE
Checks timeout

while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
    {
      /* Check TXE flag */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
      {
        *(__IO uint8_t *)&hspi->Instance->DR = *((const uint8_t *)hspi->pTxBuffPtr);
        hspi->pTxBuffPtr++;
        hspi->TxXferCount--;
        /* Next Data is a reception (Rx). Tx not allowed */
        txallowed = 0U;
      }

      /* Wait until RXNE flag is reset */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
      {
        (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
        hspi->pRxBuffPtr++;
        hspi->RxXferCount--;
        /* Next Data is a Transmission (Tx). Tx is allowed */
        txallowed = 1U;
      }
      if ((((HAL_GetTick() - tickstart) >=  Timeout) && ((Timeout != HAL_MAX_DELAY))) || (Timeout == 0U))
      {
        hspi->State = HAL_SPI_STATE_READY;
        __HAL_UNLOCK(hspi);
        return HAL_TIMEOUT;
      }
    }
  }

These software checks introduce delay between successive writes to SPI->DR

This delay causes gaps in SCK, reducing effective throughput.

My questions

Why is the official STM32 HAL SPI API unable to keep the clock running continuously at the configured 9 MHz? otherwise why would it offer 9Mhz?
Is SPI clock generation strictly tied to writes to SPI->DR?
Is there any way (using HAL) to keep SCK running continuously while receiving data?
Where can I study the internal hardware of the STM32F103 SPI peripheral in more detail like the complete logic diagram or gates circuit to check how the clock is controlled?
Where can I study the hardware of SPI peripheral of stm32f103rct6 in more details than ref manual, may be I can get a logic diagram or circuit diagram to see it in more detial specially how clock is being gating?
What standard does the logic of SPI hardware follow? may be I can go through some SPI standards and understand this behaviour.

Motivation

I understand this might seem like going into excessive detail, but this is purely for learning purposes. I want to understand how SPI really works at the hardware level — not just from an API point of view, but the actual mechanics behind clock generation, data shifting, and timing.

Any insights, corrections, or references would be greatly appreciated.

Thank you for your time.

Andrew Neil · ‎2026-01-15

@Ozone wrote:
The "S" stands for synchronous, .

No, it stands for 'Serial' (although it is also synchronous).

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

Ozone · ‎2026-01-15

In this you are right ...

Although inter-chip connections via shift registers existed already before '79.

SPI is still synchronous, making it often somewhat difficult to understand for beginners.