cancel
Showing results for 
Search instead for 
Did you mean: 

How to make an efficient bit-bang driver on STM32

MCU Support TD
ST Employee

Introduction

This article discusses the theory behind making a bit-bang driver as efficient as possible.

In the bottom of this article, I have attached a single header file based bit-bang SPI controller mode 0 driver.

1. Pros and cons of a bit-bang driver

1.1 Pros

Although STM32s embed many peripherals to manage communications through standard interfaces already, here are some reasons one might need a bit-bang driver:

  • Working with legacy or nonstandard protocols: 
    Sometimes, you may be placing a modern MCU into a system that uses a legacy protocol that is not supported. Perhaps an MCU can handle the majority of needed protocols in a system. However, there is just one particular nonstandard protocol you need to handle. In these cases, bit-banging can be a better option than searching for hardware dedicated to it
  • You ran out of interfaces on your MCU:
    You have already used up all the available interfaces available on your MCU, and you only need one more. Instead of buying a new MCU with more interfaces, you can bit-bang it.

1.2 Cons

  • Bit-banging is not efficient in terms of processing power and timing:
    Since bit-banging is typically implemented using software loops and delays, it can be more power hungry. Furthermore, it cannot be implemented using interrupts, so it always blocks other functionalities. Another downside of bit-banging is that the timing is not perfectly predictable. Since it is software based, it is hard to precisely predict the bandwidth of the protocol.

3. Making a bit-bang driver

3.1 Let the preprocessor do as much work as possible

For the best performance, you want to let the compiler preprocessor do as much work as possible. Of course, there is always a trade-off between modularity and efficiency. However, since this article is concerned with efficiency, we take it at the cost of modularity.

In my SPI driver, all the GPIO toggling, GPIO reading, and delays are handled through macros. This allows for at least some amount of modularity for the driver, while still being efficient:

/*
 * Configuration Defines
 */
#define LSB_FIRST 0
#define DELAY_CYCLES 1


/*
 *  Hardware Definitions for Pin Setting and Resetting
 *
 *  This driver does not initialize the pins.
 */

#define CS_LOW  GPIOB->BSRR = GPIO_BSRR_BR0
#define CS_HIGH GPIOB->BSRR = GPIO_BSRR_BS0

#define MOSI_LOW  GPIOB->BSRR = GPIO_BSRR_BR6
#define MOSI_HIGH GPIOB->BSRR = GPIO_BSRR_BS6

#define SCK_LOW  GPIOA->BSRR = GPIO_BSRR_BR10
#define SCK_HIGH GPIOA->BSRR = GPIO_BSRR_BS10

//Returns non-zero if MISO is set, 0 if non set
#define MISO_SET (GPIOB->IDR & GPIO_IDR_ID7)

#define DELAY for (uint32_t i = 0; i < DELAY_CYCLES; i++) __NOP()

Definitions for the macros should read and write GPIOs as fast as possible. It is not recommended to use HAL library calls here. The way I allow for a configurable baud rate is to allow for the user to configure the number of delay cycles through the DELAY_CYCLES macro. I originally experimented with a runtime variable, that allowed the user to change a static global variable containing the amount of delay. This was not as efficient as letting the preprocessor handle it, so I opted for this instead.

You will also notice that I used a preprocessor to manage the configuration of whether or not the protocol is LSB first or MSB first. Although use the preprocessor for initialization parameters can be visually cumbersome, it help's avoid the overhead of conditional statements in the most critical section of the driver.

static inline uint8_t spi_bitbang_master_tx_rx_byte(uint8_t data) {
  uint8_t rx_data = 0;
#if LSB_FIRST
  for (uint32_t i = 0; i < 8; i++) {
    if (data & (1 << i)) {
#else
  for (int i = 7; i >= 0; i--) {
    if (data & (1 << i)) {
#endif
      MOSI_HIGH;
    } else {
      MOSI_LOW;
    }
    SCK_HIGH;
    DELAY;
    if (MISO_SET)
      rx_data |= 1 << i;
    SCK_LOW;
    DELAY;
  }
  return rx_data;
}

3.2 Taking full advantage of optimizations

The driver is implemented in one header file. This allows for all functions to be declared as static inline and still be used in multiple source files. This has the advantage of keeping things fast by reducing function call overhead. However, it increases the code size compared to making them function since all files that include it places the full function inline with where it is called. Since these functions are short and simple, the extra code size is fairly negligible compared to the performance gain, but you can decide this trade off for yourself.

The optimization level of the compiler makes a significant difference on the throughput of your bit-banging. For the best performance, you want to have the highest optimizations on ("-O3").

If you are finding that high optimizations levels are making other parts of the code hard to debug, you can use compiler attributes to make only the bit-bang functions themselves optimized instead of the entire project:

__attribute__((optimize("-O3")))
static inline uint8_t spi_bitbang_master_tx_rx_byte(uint8_t data) {
...
...
}

This helps you retain most of the performance even when the code overall is compiled with no optimizations. It is not as fast as the code fully compiled with "-O3."

4. Performance

These tests were performed using two NUCLEO-C031 boards, with the MCU system clock running at 48MHz. One board is the controller using the bit-bang SPI driver, and the other is a target, which uses the hardware SPI peripheral. The test does a full-duplex, 256-byte transmit receive between the two devices. Proper reception of data was verified on both ends in debug.

4.1 Controller device

CS_LOW;
spi_bitbang_master_tx_rx_multi(tx_data, rx_data, 256);
CS_HIGH;

4.2 Target device

HAL_SPI_TransmitReceive(&hspi1, tx_data, rx_data, 256, 0xFFFF);

The time taken for the transaction is measured from falling edge to rising edge of the chip select. It is better to describe the time taken for a chunk of data, since the frequency of the clock from period to period when bit-banging:

5. Results

Function local optimization Project optimization Time for a 256-byte transaction Average clock frequency 
none none 3.479 ms 636 kHz
-O3 none 1.538 ms 1.381 MHz
-O3 -O3 795 us 2.915 MHz

Conclusion 

When you are running short on peripherals or are working with proprietary and/or legacy protocols, bit-banging is a helpful option to have. Bit-Banging has the advantage of flexibility, but also struggles when it comes to performance and efficiency. Because of these fallbacks, it is important that the software behind a bit-bang driver is as efficient as possible and takes full advantage of compiler optimizations.

Version history
Last update:
‎2024-06-18 12:56 AM
Updated by: