Skip to main content
SA V.1
Associate III
September 24, 2024
Question

Time Delay in FFT performance

  • September 24, 2024
  • 4 replies
  • 4417 views

Hello community,
The CMSIS DSP library supports a 4096-point FFT, but my project requires a 16384-point FFT. I added some extra library files (downloaded from GitHub; these files are written in C) to my project and successfully
performed a 16384-point FFT. However, the issue is that performing the FFT takes too much time, with a delay of around 27 to 30 seconds. Does anyone have a solution to make this more efficient and reduce the time?

 

4 replies

AScha.3
Super User
September 24, 2024

Hi,

16k FFT , is about N*log(N) butterfly ops -> about 230k ; assuming the cpu can do 4Mio /s , your FFT should run in about 100ms or so, not 30 s.

What we talk about : data/FFT in place? only real data ? fixed 16b or float or double data/FFT ?

 

To get best speed, you have to use (for ARM ) optimized FFT, CMSIS DSP library  should be this exactly.

Then your data : in RAM, all caches ON. (I+D)

And code with optimizer on -O2 or -Ofast .

"If you feel a post has answered your question, please click ""Accept as Solution""."
SA V.1
SA V.1Author
Associate III
September 24, 2024

16k FFT , is about N*log(N) butterfly ops -> about 230k ; assuming the cpu can do 4Mio /s , your FFT should run in about 100ms or so, not 30 s.  ---> i didn't understand this calculation ?  

FFT input is float, am using ARM(STM32H745xx muc CortexM7),CMSIS DSP library limit is 4096 points my project requrement 16384 points .

 

Andrew Neil
Super User
September 24, 2024

@SA V.1 wrote:

FFT input is float,


Beware that standard C libraries often use double - does the H7's floating point unit support double ?

If not, the calculations will be done in software...

 


@SA V.1 wrote:

 

I added some extra library files (downloaded from GitHub...


So give a link.

Are those libraries intended for embedded microcontrollers?

A complex system that works is invariably found to have evolved from a simple system that worked.A complex system designed from scratch never works and cannot be patched up to make it work.
SA V.1
SA V.1Author
Associate III
September 24, 2024

H7's floating point unit support double ?

Yes  --->The Arm® Cortex®-M7 with double-precision FPU processor is the latest generation of Arm processors for embedded systems.

I add these library files and peforming the FFT and got the result also but the problen is taking too much time for this any solution ? 

 link--->https://github.com/Treeed/Long_FFTs_for_CMSIS_DSP/tree/master

TDK
September 24, 2024
"If you feel a post has answered your question, please click ""Accept as Solution""."
SA V.1
SA V.1Author
Associate III
October 3, 2024

In the CMSIS DSP library, a 4K FFT takes 1.6 seconds to execute. When I enable the I+D cache, it takes 440 ms. However, when I add extra DSP library files to perform a 16,384-point FFT, it takes 27 seconds to execute. With the I+D cache enabled, it takes 7 seconds. Is there Any Solution ???

Andrew Neil
Super User
October 3, 2024

So, with cache:

  •  4,096 points takes 440 ms;
  • 16,384 points takes 7s

without cache

  •  4,096 points takes 1.6s;
  • 16,384 points takes 27s

In both cases, the difference is a factor of 16:

16K points is 4 times as many as 4K points;

4 squared is 16.

Is it a surprise that multiplying the number of points by X multiplies the execution time by X squared ?

 

A complex system that works is invariably found to have evolved from a simple system that worked.A complex system designed from scratch never works and cannot be patched up to make it work.