cancel
Showing results for 
Search instead for 
Did you mean: 

hand optimized FFT/IFFT for Cortex-M3 attached

imellen
Associate II
Posted on July 30, 2009 at 14:52

hand optimized FFT/IFFT for Cortex-M3 attached

19 REPLIES 19
lanchon
Associate II
Posted on May 17, 2011 at 12:27

hi, thanks for the contribution.

the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting?

the comments regarding version 2.0 are a leftover from a previous project or something like that?

imellen
Associate II
Posted on May 17, 2011 at 12:27

I could not find any reasonable fast FFT and IFFT routine for STM32 so I wrote my own in highly optimized assembly. FFT runs at around 70 - 75% of speed when compared with STR910 implementation(at same clock frequency). It requires roughly same amount of cycles as dsPIC optimized FFT. I estimate it is almost twice so fast as ARM7 FFT.

Enjoy :D

Some highlights:

- Radix 4 FFT /IFFT - supported sizes: N=4,16,64,256,1024,4096

- N>4096 possible with custom coefficients

- 16 bit complex arithmetic, 1Q15 coefficients

- input data remains unmodified

- decimation-in-time with auto scale after each stage - no overflow

- GCC version (Code Sourcery G++ Lite 2007q3-53), requires C preprocessor

- hand optimized THUMB2 assembly for Cortex-M3 (tested with STM32)

- code optimized with 64 bit flash prefetch and branch speculation in mind

- single function for multiple FFT sizes

- functions are both ''C'' and assembly callable

STM32 FFT benchmarks in CPU cycles based on real hardware measurements:

N - FFT size

L - Flash latency

F,R - coefficients in Flash or RAM

....N....L=0 F/R....L=1 F....L=1 r*....L=2 F*....L=2 r*...ARM7TDMI...ARM9E...dsPIC

....64.....3575......3797......3636.......4588......4007..........-............2701....3739

...256....19425.....20685....19743....25144.....21685..38461-43920..13740..19055

.1024....98541....105113...100074...128070....109634.......-...........68534....N/A.

Notes:ARM9E - DSP enhanced arm9 core, based on STR910 @96 MHz, RAM coefficients

dsPIC - dsPIC30x / dsPIC33x - results based on Microchip's DSP library

ARM7TDMI - 3rd party, web based benchmark results

*STM32 results for latency L=2 or - source compiled with LATENCY2 defined

IFFT benchmarks (calculated and verified by measurement):

add 6+N/4 to FFT benchmark {22,70,262} for N={64,256,1024}

Code size:

FFT only: 480 bytes

FFT+IFFT: 682 bytes

Coefficient size (Flash or RAM):

N: 16 64 256 1024 4096

size: 48 240 1008 4080 16368 bytes

ivan239955_stm1_st
Associate II
Posted on May 17, 2011 at 12:27

>the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting?

Exactly. Speed improvement is around 1% for this case (Latency1, ram coeff), nevertheless it is faster. Optimizing for flash latency above 0 is tricky, there are always trade-offs (e.g. wider form of instruction aligns branch target below to 64 prefetch boundary, but cost is extra pre-fetch cycle in certain conditions)

>the comments regarding version 2.0 are a leftover from a previous project or something like that?

Version 2.0 is not finished yet, other projects have higher priority. I'll post it when its finished and fully tested.

Ivan

zhongwei
Associate
Posted on May 17, 2011 at 12:27

Wonderful efforts! I'm waiting for Version 2.0. Let me know when it is ready.

Thanks!

cbargagli9
Associate II
Posted on May 17, 2011 at 12:27

Hi everybody,

I'm interested in developing a FFT application, and recently i got a IAR STM32-SK developement board.

I'm quite young with this kind of family, so my question is: has anyone got an example project so that I can ''move on it'' doing my tests?

Thanks in advance,

Claudio

bargaclau@libero.it

cvrc
Associate
Posted on May 17, 2011 at 12:27

Hi Ivan,

I see your complex FFT for STM32. I'm quite young with this and I have four questions:

1. In which order must be data in input sequence, normal or digit-order?

2. In which order iz data in output sequence, normal or digit-inverse odred?

3. If the input stream is real what I must done to get good fft result?

4. When Version 2.0 will be finished?

my email:

mailto:cvrc@ptt.rs

Best regards,

Stojan

imellen
Associate II
Posted on May 17, 2011 at 12:27

Quick answers:

>In which order must be data in input sequence, normal or digit-order?

input is in normal ordel e.g. [r0 i0 r1 i1 r2 i2... ] where r is real part, i is imaginary part

>In which order iz data in output sequence, normal or digit-inverse odred?

output is in normal order (bit reversal performed in asm routine)

> If the input stream is real what I must done to get good fft result?

a) keep imaginary part 0 (inefficient)

b) calculate 2 real ffts of size N in the same time (real part = real input1; imag part= real input2), postprocessing required to separate outputs

c) calculate real FFT with size 2N (similar to b) )

d) for processing in the frequency domain (real input, real filter) no postprocessing required e.g. real Left channel, real Right channel, real Filter impulse response

FFT1: real part = filter impulse response, imag part =0; calaculate once as filter is constant; output - complex F1 sequence

FFT2: real part = Left channel imag part =Right channel; output complex F2 sequence

F3 = F1*F2 (N complex multiplies)

z=IFFT(F3) ; complex sequence where real is Left channel after frequency domain processing; imag is Right channel afte freq. dom. processing

left and righ channel independent.

Find some papers on FFT theory and convolution in the frequency domain as there are aliasing issues.

> When Version 2.0 will be finished?

I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code 🙂

for more theory and fft of real sequences check: www.engineeringproductivitytools.com/stuff/T0001/index.html

Ivan

16-32micros
Associate III
Posted on May 17, 2011 at 12:27

Dear Ivan,

Quote:

> When Version 2.0 will be finished?

I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code

Is it possible to provide just the FFT Benchmarks cycles you get with Radix-2 (size 8,32,126,512,2048) as you did with Radix-4 ?

All Forums users Thank you a lot for your FFT development for STM32 🙂

Cheers,

STOne-32.

sword_82
Associate II
Posted on May 17, 2011 at 12:27

Hi,

Good work! But I have one question, I tried to use this code with the RIDE7, while it'is using the GNU compiler, but I have met some errors. What toolchain have you used? and how I can use the C preprocessor with an assembly file.

Thanks to all

Regards

sword_82