on
2024-01-29
06:44 AM
- edited on
2024-06-18
12:56 AM
by
Laurids_PETERSE
This article discusses the theory behind making a bit-bang driver as efficient as possible.
In the bottom of this article, I have attached a single header file based bit-bang SPI controller mode 0 driver.
Although STM32s embed many peripherals to manage communications through standard interfaces already, here are some reasons one might need a bit-bang driver:
For the best performance, you want to let the compiler preprocessor do as much work as possible. Of course, there is always a trade-off between modularity and efficiency. However, since this article is concerned with efficiency, we take it at the cost of modularity.
In my SPI driver, all the GPIO toggling, GPIO reading, and delays are handled through macros. This allows for at least some amount of modularity for the driver, while still being efficient:
/*
* Configuration Defines
*/
#define LSB_FIRST 0
#define DELAY_CYCLES 1
/*
* Hardware Definitions for Pin Setting and Resetting
*
* This driver does not initialize the pins.
*/
#define CS_LOW GPIOB->BSRR = GPIO_BSRR_BR0
#define CS_HIGH GPIOB->BSRR = GPIO_BSRR_BS0
#define MOSI_LOW GPIOB->BSRR = GPIO_BSRR_BR6
#define MOSI_HIGH GPIOB->BSRR = GPIO_BSRR_BS6
#define SCK_LOW GPIOA->BSRR = GPIO_BSRR_BR10
#define SCK_HIGH GPIOA->BSRR = GPIO_BSRR_BS10
//Returns non-zero if MISO is set, 0 if non set
#define MISO_SET (GPIOB->IDR & GPIO_IDR_ID7)
#define DELAY for (uint32_t i = 0; i < DELAY_CYCLES; i++) __NOP()
Definitions for the macros should read and write GPIOs as fast as possible. It is not recommended to use HAL library calls here. The way I allow for a configurable baud rate is to allow for the user to configure the number of delay cycles through the DELAY_CYCLES macro. I originally experimented with a runtime variable, that allowed the user to change a static global variable containing the amount of delay. This was not as efficient as letting the preprocessor handle it, so I opted for this instead.
You will also notice that I used a preprocessor to manage the configuration of whether or not the protocol is LSB first or MSB first. Although use the preprocessor for initialization parameters can be visually cumbersome, it help's avoid the overhead of conditional statements in the most critical section of the driver.
static inline uint8_t spi_bitbang_master_tx_rx_byte(uint8_t data) {
uint8_t rx_data = 0;
#if LSB_FIRST
for (uint32_t i = 0; i < 8; i++) {
if (data & (1 << i)) {
#else
for (int i = 7; i >= 0; i--) {
if (data & (1 << i)) {
#endif
MOSI_HIGH;
} else {
MOSI_LOW;
}
SCK_HIGH;
DELAY;
if (MISO_SET)
rx_data |= 1 << i;
SCK_LOW;
DELAY;
}
return rx_data;
}
The driver is implemented in one header file. This allows for all functions to be declared as static inline and still be used in multiple source files. This has the advantage of keeping things fast by reducing function call overhead. However, it increases the code size compared to making them function since all files that include it places the full function inline with where it is called. Since these functions are short and simple, the extra code size is fairly negligible compared to the performance gain, but you can decide this trade off for yourself.
The optimization level of the compiler makes a significant difference on the throughput of your bit-banging. For the best performance, you want to have the highest optimizations on ("-O3").
If you are finding that high optimizations levels are making other parts of the code hard to debug, you can use compiler attributes to make only the bit-bang functions themselves optimized instead of the entire project:
__attribute__((optimize("-O3")))
static inline uint8_t spi_bitbang_master_tx_rx_byte(uint8_t data) {
...
...
}
This helps you retain most of the performance even when the code overall is compiled with no optimizations. It is not as fast as the code fully compiled with "-O3."
These tests were performed using two NUCLEO-C031 boards, with the MCU system clock running at 48MHz. One board is the controller using the bit-bang SPI driver, and the other is a target, which uses the hardware SPI peripheral. The test does a full-duplex, 256-byte transmit receive between the two devices. Proper reception of data was verified on both ends in debug.
CS_LOW;
spi_bitbang_master_tx_rx_multi(tx_data, rx_data, 256);
CS_HIGH;
HAL_SPI_TransmitReceive(&hspi1, tx_data, rx_data, 256, 0xFFFF);
The time taken for the transaction is measured from falling edge to rising edge of the chip select. It is better to describe the time taken for a chunk of data, since the frequency of the clock from period to period when bit-banging:
Function local optimization | Project optimization | Time for a 256-byte transaction | Average clock frequency |
none | none | 3.479 ms | 636 kHz |
-O3 | none | 1.538 ms | 1.381 MHz |
-O3 | -O3 | 795 us | 2.915 MHz |
When you are running short on peripherals or are working with proprietary and/or legacy protocols, bit-banging is a helpful option to have. Bit-Banging has the advantage of flexibility, but also struggles when it comes to performance and efficiency. Because of these fallbacks, it is important that the software behind a bit-bang driver is as efficient as possible and takes full advantage of compiler optimizations.