cancel
Showing results for 
Search instead for 
Did you mean: 

Time consumption FFT calculation. CMSIS, STM32F401RE

felix23
Associate II

Hello,

i calculate an FFT with the STM32F401RE. It works fine so far. Here are the relevant code snippets:

arm_rfft_q31(&S1, (q31_t*)fft_input_buf1, (q31_t*)fft_complex_buf1);		//takes about 1760 µs
 
arm_cmplx_mag_q31((q31_t*)fft_complex_buf1, (q31_t*)fft_amplitude_buf1, (uint32_t)FFT_LENGTH);	// takes about 340 µs.

I use DMA to collect the ADC data (2048 points) and calculate an 1024 points FFT both, after half the buffer is filled, and after the complete buffer is filled.

The results looks good. However, now I try to optimize the code in terms of performance and things become strange:

  • I measure the elapsed time by setting and resetting a GPIO pin and measure the time with an Oszi.
  • I always start measuring the time at the beginning of the HAL_ADC_ConvHalfCpltCallback and the HAL_ADC_ConvCpltCallback.
  • I do the FFT calculation directly in the IR Handler. While normally the preemption priority is set to 15, for the tests, I set it to 0 to avoid any interruptions of the FFT calculation.

My observation:

  • Duration of FFT calculation and data copying... takes 2680 µs if triggered from the HAL_ADC_ConvCpltCallback
  • Duration of FFT calculation and data copying... takes 3570 µs if triggered from the HAL_ADC_ConvHalfCpltCallback
  • If I omit one of the two time consuming steps I stated above, both code executions take the same time (e.g. 2370 µs for calculation of FFT without calculating the magnitude)
  • If I replace the FFT code shown above by a delay loop, like below. Both code executions take the same time - even with higher or lower delay times.
uint32_t delay = 20000;	// 20000/84MHz * 10=  2380 µs
while (delay > 0){
	delay --;
}
  • I also tried setting up seperate variables for both code executions, to execute eveything in its own memory space. The result looks the same. Code execution takes different time.
	extern int32_t fft_input_buf1 [FFT_LENGTH];
	extern int32_t fft_complex_buf1[FFT_LENGTH*2];
	extern int32_t fft_amplitude_buf1[FFT_LENGTH*1];
	extern int32_t fft_input_buf2 [FFT_LENGTH];
	extern int32_t fft_complex_buf2[FFT_LENGTH*2];
	extern int32_t fft_amplitude_buf2[FFT_LENGTH*1];
  • The Compiler Optimization is set to None (-O0), changing to speed (-Ofast) gives similar result: 3090 µs vs 2280 µs.
  • If I only trigger the FFT calculation in one of the callbacks (the other is doing nothing), the needed time consumption is the same for both cases (2280µs (if optimized for fast code execution), or 2680 µs for no optimization)

My conclusions:

  • It seems to be no interrupt related topic, as the pure waiting works exactly the same in both cases. In addition, I don't know, what ISR execution should take 900 µs...
  • If the FFT calculation is called less often, it seems, that the calculation takes always the "short" period.
  • It looks, as if the FFT calculation in some cases takes more time as in other cases. I have no Idea why. Is there some data moved around?

Finally:

  • what can I do to have repeatable time consumption for the FFT calculation?
  • what can I do to minimize the needed time?

Thanks a lot for your help,

Cheers

3 REPLIES 3
ChahinezC
Lead

Hello @felix23​,

I recommend you referring to the "Digital signal processing for STM32 microcontrollers using CMSIS" application note (AN4841).

It contains a typical example with explicit results that can help you, please check the 4.2.2 section.

I also recommend you to download the X-CUBE-DSPDEMO, it contains several examples, you can check them.

Chahinez.

felix23
Associate II

Hello Chahinez,

thanks a lot for your hints. I have already read this document and it was helpful. The FFT output is reasonable. I am happy with the nurmeric result. I just wonder, if I do something wrong because these strange 900 µs are needed in every second FFT calculation and only, if I calculate both, FFT and magnitude.

I forgot to mention, that I sample the analog data with 102,4 kHz, so the ADC buffer is half filled every 10 ms. This means, that an FFT calcualtion takes place every 10000 µs. This should not lead to trouble, when the calculation of the FFT including magnitude takes 3570 µs in the worst case, right?

BR

Felix

ChahinezC
Lead

Hello @felix23​,

I suggest you the following:

  • When starting the handler, just disable the IRQ, using the “__disable_irq()�? function and enable it when finishing, by the "__enable_irq()�? function, to ensure no other interruption occurred. 
  • What is your Flash configuration?

Try disabling/enabling the data cache, instruction cache, prefetch bits of the flash access control register (Flash_ACR), please refer to the 3.8.1 section of the RM0368.

In case of cache and prefetch enabled, it is possible that during the half complete handler the FFT function was not preloaded in the cache, so the CPU fetched it from Flash. Then right after in the complete handler, the FFT function instruction were executed from cache, which can explain the shorter duration.

I suggest adding a cache reset before executing the FFT functions just to confirm this hypothesis.

  • Try executing the Handler or the two Handlers from the SRAM.

Please keep me updated whether one of the previous suggestions helped you.

Chahinez.