hand optimized FFT/IFFT for Cortex-M3 attached

imellen · ‎2009-07-30

Posted on July 30, 2009 at 14:52

lanchon · ‎2011-05-17

Posted on May 17, 2011 at 12:27

hi, thanks for the contribution.

the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting?

the comments regarding version 2.0 are a leftover from a previous project or something like that?

imellen · ‎2011-05-17

Posted on May 17, 2011 at 12:27

I could not find any reasonable fast FFT and IFFT routine for STM32 so I wrote my own in highly optimized assembly. FFT runs at around 70 - 75% of speed when compared with STR910 implementation(at same clock frequency). It requires roughly same amount of cycles as dsPIC optimized FFT. I estimate it is almost twice so fast as ARM7 FFT.

Enjoy :D

Some highlights:

- Radix 4 FFT /IFFT - supported sizes: N=4,16,64,256,1024,4096

- N>4096 possible with custom coefficients

- 16 bit complex arithmetic, 1Q15 coefficients

- input data remains unmodified

- decimation-in-time with auto scale after each stage - no overflow

- GCC version (Code Sourcery G++ Lite 2007q3-53), requires C preprocessor

- hand optimized THUMB2 assembly for Cortex-M3 (tested with STM32)

- code optimized with 64 bit flash prefetch and branch speculation in mind

- single function for multiple FFT sizes

- functions are both ''C'' and assembly callable

STM32 FFT benchmarks in CPU cycles based on real hardware measurements:

N - FFT size

L - Flash latency

F,R - coefficients in Flash or RAM

....N....L=0 F/R....L=1 F....L=1 r*....L=2 F*....L=2 r*...ARM7TDMI...ARM9E...dsPIC

....64.....3575......3797......3636.......4588......4007..........-............2701....3739

...256....19425.....20685....19743....25144.....21685..38461-43920..13740..19055

.1024....98541....105113...100074...128070....109634.......-...........68534....N/A.

Notes:ARM9E - DSP enhanced arm9 core, based on STR910 @96 MHz, RAM coefficients

dsPIC - dsPIC30x / dsPIC33x - results based on Microchip's DSP library

ARM7TDMI - 3rd party, web based benchmark results

*STM32 results for latency L=2 or - source compiled with LATENCY2 defined

IFFT benchmarks (calculated and verified by measurement):

add 6+N/4 to FFT benchmark {22,70,262} for N={64,256,1024}

Code size:

FFT only: 480 bytes

FFT+IFFT: 682 bytes

Coefficient size (Flash or RAM):

N: 16 64 256 1024 4096

size: 48 240 1008 4080 16368 bytes

ivan239955_stm1_st · ‎2011-05-17

Posted on May 17, 2011 at 12:27

>the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting?

Exactly. Speed improvement is around 1% for this case (Latency1, ram coeff), nevertheless it is faster. Optimizing for flash latency above 0 is tricky, there are always trade-offs (e.g. wider form of instruction aligns branch target below to 64 prefetch boundary, but cost is extra pre-fetch cycle in certain conditions)

>the comments regarding version 2.0 are a leftover from a previous project or something like that?

Version 2.0 is not finished yet, other projects have higher priority. I'll post it when its finished and fully tested.

Ivan

zhongwei · ‎2011-05-17

Posted on May 17, 2011 at 12:27

Wonderful efforts! I'm waiting for Version 2.0. Let me know when it is ready.

Thanks!

cbargagli9 · ‎2011-05-17

Posted on May 17, 2011 at 12:27

Hi everybody,

I'm interested in developing a FFT application, and recently i got a IAR STM32-SK developement board.

I'm quite young with this kind of family, so my question is: has anyone got an example project so that I can ''move on it'' doing my tests?

Thanks in advance,

Claudio

bargaclau@libero.it

cvrc · ‎2011-05-17

Posted on May 17, 2011 at 12:27

Hi Ivan,

I see your complex FFT for STM32. I'm quite young with this and I have four questions:

1. In which order must be data in input sequence, normal or digit-order?

2. In which order iz data in output sequence, normal or digit-inverse odred?

3. If the input stream is real what I must done to get good fft result?

4. When Version 2.0 will be finished?

my email:

mailto:cvrc@ptt.rs

Best regards,

Stojan

imellen · ‎2011-05-17

Posted on May 17, 2011 at 12:27

Quick answers:

>In which order must be data in input sequence, normal or digit-order?

input is in normal ordel e.g. [r0 i0 r1 i1 r2 i2... ] where r is real part, i is imaginary part

>In which order iz data in output sequence, normal or digit-inverse odred?

output is in normal order (bit reversal performed in asm routine)

> If the input stream is real what I must done to get good fft result?

a) keep imaginary part 0 (inefficient)

b) calculate 2 real ffts of size N in the same time (real part = real input1; imag part= real input2), postprocessing required to separate outputs

c) calculate real FFT with size 2N (similar to b) )

d) for processing in the frequency domain (real input, real filter) no postprocessing required e.g. real Left channel, real Right channel, real Filter impulse response

FFT1: real part = filter impulse response, imag part =0; calaculate once as filter is constant; output - complex F1 sequence

FFT2: real part = Left channel imag part =Right channel; output complex F2 sequence

F3 = F1*F2 (N complex multiplies)

z=IFFT(F3) ; complex sequence where real is Left channel after frequency domain processing; imag is Right channel afte freq. dom. processing

left and righ channel independent.

Find some papers on FFT theory and convolution in the frequency domain as there are aliasing issues.

> When Version 2.0 will be finished?

I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code :)

for more theory and fft of real sequences check: www.engineeringproductivitytools.com/stuff/T0001/index.html

Ivan

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 12:27

Dear Ivan,

Quote:

> When Version 2.0 will be finished?

I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code

Is it possible to provide just the FFT Benchmarks cycles you get with Radix-2 (size 8,32,126,512,2048) as you did with Radix-4 ?

All Forums users Thank you a lot for your FFT development for STM32 :)

Cheers,

STOne-32.

sword_82 · ‎2011-05-17

Posted on May 17, 2011 at 12:27

Hi,

Good work! But I have one question, I tried to use this code with the RIDE7, while it'is using the GNU compiler, but I have met some errors. What toolchain have you used? and how I can use the C preprocessor with an assembly file.

Thanks to all

Regards

sword_82