STM32L432KC basic DSP loop. Any way to improve this?

JBeal.1 · ‎2023-12-09

I am using Nucleo STM32 L432KC board with STM32CubeIDE v1.12.1 and doing very basic DSP operations with data values read via ADC through DMA into a circular buffer. It works, but the math is slower than expected.

Below is an example loop in my code. All variables are declared type int except for adc_buffer[] which is uint16_t.

	HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_SET); // GPIO signal flag
    for (int i = idxStart; i < idxEnd; i++) {
      x = adc_buffer[i];
      sum += x;
    }
	HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_RESET); // GPIO signal flag
	// high for 1.05 msec: 1 loop time = 262.5 ns = 21 clocks @ 80 MHz

The for loop in this case is running through 4000 cycles (half of my 8000 element buffer). Based on the GPIO 3 output pulse on my scope, 4000 cycles takes 1.05 ms, therefore one loop cycle is 262.5 nsec, and here the CPU clock is f=80 MHz (t=12.5 ns) so one loop takes 21 clocks. Looks to me like I'm doing only four operations: an increment and conditional branch for the loop, fetching a 16-bit number, and adding it to a 32-bit sum. Is there any way to do that in less than 21 clocks, or is this the best it can do? I did try removing the explicit intermediate variable x and writing sum += adc_buffer[i]; but the timing did not change.

This device has a 5 MHz ADC. Eventually I want to count analog input peaks above some threshold, and track their amplitude, but it doesn't look like I can do very much math per sample in real time.

JBeal.1 · ‎2023-12-09

To answer my own question, I realized I had not even looked at the default compiler optimization flags https://community.st.com/t5/stm32cubeide-mcus/how-do-i-change-code-optimization/td-p/271208, and I discovered I was using -O0 (no optimization at all). I changed that to -Ofast and got 7 cycles per loop, or exactly 3 times faster. That's now a lot more hopeful that I can do something useful in the time available.

View solution in original post

JBeal.1 · ‎2023-12-09

To answer my own question, I realized I had not even looked at the default compiler optimization flags https://community.st.com/t5/stm32cubeide-mcus/how-do-i-change-code-optimization/td-p/271208, and I discovered I was using -O0 (no optimization at all). I changed that to -Ofast and got 7 cycles per loop, or exactly 3 times faster. That's now a lot more hopeful that I can do something useful in the time available.

TDK · ‎2023-12-09

Just to add:

Not all operations take a single cycle. Most of them take more.
You can view the disassembly to get a good idea of what exactly the processor is doing and how it can be improved (debug, set a breakpoint, then view Disassembly when it's at the breakpoint)
Processing ADC data at 5 Msps in realtime is going to be tough, regardless. Tracking min/max/avg definitely possible. More advanced, perhaps not.

If you feel a post has answered your question, please click "Accept as Solution".

gregstm · ‎2023-12-10

I got fed up with looping overheads for my FIR routines. So I created an assembly FIR routine that has all the instructions in a long table. At the start of the routine I calculate where I have to jump to into the table to perform the required number of "unwrapped" loops. Works well for my applications. Using 16 bit data DSP instructions you can do a lot of processing in in a few cycles. I pushed the technique to the limit creating a FIR filter that uses less than one cycle per tap on average (but it was only for an 8 tap filter) - this is what I posted a while ago. https://community.st.com/t5/analog-and-audio/here-you-go-an-fir-filter-for-the-m4-that-uses-less-than-1-cycle/td-p/257615