2023-12-09 09:15 AM
I am using Nucleo STM32 L432KC board with STM32CubeIDE v1.12.1 and doing very basic DSP operations with data values read via ADC through DMA into a circular buffer. It works, but the math is slower than expected.
Below is an example loop in my code. All variables are declared type int except for adc_buffer[] which is uint16_t.
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_SET); // GPIO signal flag
for (int i = idxStart; i < idxEnd; i++) {
x = adc_buffer[i];
sum += x;
}
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_RESET); // GPIO signal flag
// high for 1.05 msec: 1 loop time = 262.5 ns = 21 clocks @ 80 MHz
The for loop in this case is running through 4000 cycles (half of my 8000 element buffer). Based on the GPIO 3 output pulse on my scope, 4000 cycles takes 1.05 ms, therefore one loop cycle is 262.5 nsec, and here the CPU clock is f=80 MHz (t=12.5 ns) so one loop takes 21 clocks. Looks to me like I'm doing only four operations: an increment and conditional branch for the loop, fetching a 16-bit number, and adding it to a 32-bit sum. Is there any way to do that in less than 21 clocks, or is this the best it can do? I did try removing the explicit intermediate variable x and writing sum += adc_buffer[i]; but the timing did not change.
This device has a 5 MHz ADC. Eventually I want to count analog input peaks above some threshold, and track their amplitude, but it doesn't look like I can do very much math per sample in real time.
Solved! Go to Solution.
2023-12-09 09:56 AM
To answer my own question, I realized I had not even looked at the default compiler optimization flags https://community.st.com/t5/stm32cubeide-mcus/how-do-i-change-code-optimization/td-p/271208, and I discovered I was using -O0 (no optimization at all). I changed that to -Ofast and got 7 cycles per loop, or exactly 3 times faster. That's now a lot more hopeful that I can do something useful in the time available.
2023-12-09 09:56 AM
To answer my own question, I realized I had not even looked at the default compiler optimization flags https://community.st.com/t5/stm32cubeide-mcus/how-do-i-change-code-optimization/td-p/271208, and I discovered I was using -O0 (no optimization at all). I changed that to -Ofast and got 7 cycles per loop, or exactly 3 times faster. That's now a lot more hopeful that I can do something useful in the time available.
2023-12-09 11:40 AM
Just to add:
2023-12-10 01:36 AM
I got fed up with looping overheads for my FIR routines. So I created an assembly FIR routine that has all the instructions in a long table. At the start of the routine I calculate where I have to jump to into the table to perform the required number of "unwrapped" loops. Works well for my applications. Using 16 bit data DSP instructions you can do a lot of processing in in a few cycles. I pushed the technique to the limit creating a FIR filter that uses less than one cycle per tap on average (but it was only for an 8 tap filter) - this is what I posted a while ago. https://community.st.com/t5/analog-and-audio/here-you-go-an-fir-filter-for-the-m4-that-uses-less-than-1-cycle/td-p/257615