2009-07-30 05:52 AM
hand optimized FFT/IFFT for Cortex-M3 attached
2011-05-17 03:27 AM
hi, thanks for the contribution.
the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting? the comments regarding version 2.0 are a leftover from a previous project or something like that?2011-05-17 03:27 AM
I could not find any reasonable fast FFT and IFFT routine for STM32 so I wrote my own in highly optimized assembly. FFT runs at around 70 - 75% of speed when compared with STR910 implementation(at same clock frequency). It requires roughly same amount of cycles as dsPIC optimized FFT. I estimate it is almost twice so fast as ARM7 FFT.
Enjoy :D Some highlights: - Radix 4 FFT /IFFT - supported sizes: N=4,16,64,256,1024,4096 - N>4096 possible with custom coefficients - 16 bit complex arithmetic, 1Q15 coefficients - input data remains unmodified - decimation-in-time with auto scale after each stage - no overflow - GCC version (Code Sourcery G++ Lite 2007q3-53), requires C preprocessor - hand optimized THUMB2 assembly for Cortex-M3 (tested with STM32) - code optimized with 64 bit flash prefetch and branch speculation in mind - single function for multiple FFT sizes - functions are both ''C'' and assembly callable STM32 FFT benchmarks in CPU cycles based on real hardware measurements: N - FFT size L - Flash latency F,R - coefficients in Flash or RAM ....N....L=0 F/R....L=1 F....L=1 r*....L=2 F*....L=2 r*...ARM7TDMI...ARM9E...dsPIC ....64.....3575......3797......3636.......4588......4007..........-............2701....3739 ...256....19425.....20685....19743....25144.....21685..38461-43920..13740..19055 .1024....98541....105113...100074...128070....109634.......-...........68534....N/A. Notes:ARM9E - DSP enhanced arm9 core, based on STR910 @96 MHz, RAM coefficients dsPIC - dsPIC30x / dsPIC33x - results based on Microchip's DSP library ARM7TDMI - 3rd party, web based benchmark results *STM32 results for latency L=2 or - source compiled with LATENCY2 defined IFFT benchmarks (calculated and verified by measurement): add 6+N/4 to FFT benchmark {22,70,262} for N={64,256,1024} Code size: FFT only: 480 bytes FFT+IFFT: 682 bytes Coefficient size (Flash or RAM): N: 16 64 256 1024 4096 size: 48 240 1008 4080 16368 bytes2011-05-17 03:27 AM
>the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting?
Exactly. Speed improvement is around 1% for this case (Latency1, ram coeff), nevertheless it is faster. Optimizing for flash latency above 0 is tricky, there are always trade-offs (e.g. wider form of instruction aligns branch target below to 64 prefetch boundary, but cost is extra pre-fetch cycle in certain conditions) >the comments regarding version 2.0 are a leftover from a previous project or something like that? Version 2.0 is not finished yet, other projects have higher priority. I'll post it when its finished and fully tested. Ivan2011-05-17 03:27 AM
Wonderful efforts! I'm waiting for Version 2.0. Let me know when it is ready.
Thanks!2011-05-17 03:27 AM
Hi everybody,
I'm interested in developing a FFT application, and recently i got a IAR STM32-SK developement board. I'm quite young with this kind of family, so my question is: has anyone got an example project so that I can ''move on it'' doing my tests? Thanks in advance, Claudio bargaclau@libero.it2011-05-17 03:27 AM
Hi Ivan,
I see your complex FFT for STM32. I'm quite young with this and I have four questions: 1. In which order must be data in input sequence, normal or digit-order? 2. In which order iz data in output sequence, normal or digit-inverse odred? 3. If the input stream is real what I must done to get good fft result? 4. When Version 2.0 will be finished? my email:mailto:cvrc@ptt.rs
Best regards, Stojan2011-05-17 03:27 AM
Quick answers:
>In which order must be data in input sequence, normal or digit-order? input is in normal ordel e.g. [r0 i0 r1 i1 r2 i2... ] where r is real part, i is imaginary part >In which order iz data in output sequence, normal or digit-inverse odred? output is in normal order (bit reversal performed in asm routine) > If the input stream is real what I must done to get good fft result? a) keep imaginary part 0 (inefficient) b) calculate 2 real ffts of size N in the same time (real part = real input1; imag part= real input2), postprocessing required to separate outputs c) calculate real FFT with size 2N (similar to b) ) d) for processing in the frequency domain (real input, real filter) no postprocessing required e.g. real Left channel, real Right channel, real Filter impulse response FFT1: real part = filter impulse response, imag part =0; calaculate once as filter is constant; output - complex F1 sequence FFT2: real part = Left channel imag part =Right channel; output complex F2 sequence F3 = F1*F2 (N complex multiplies) z=IFFT(F3) ; complex sequence where real is Left channel after frequency domain processing; imag is Right channel afte freq. dom. processing left and righ channel independent. Find some papers on FFT theory and convolution in the frequency domain as there are aliasing issues. > When Version 2.0 will be finished? I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code :) for more theory and fft of real sequences check: www.engineeringproductivitytools.com/stuff/T0001/index.html Ivan2011-05-17 03:27 AM
Dear Ivan,
Quote:
> When Version 2.0 will be finished?
I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code Is it possible to provide just the FFT Benchmarks cycles you get with Radix-2 (size 8,32,126,512,2048) as you did with Radix-4 ? All Forums users Thank you a lot for your FFT development for STM32 :) Cheers, STOne-32.2011-05-17 03:27 AM
Hi,
Good work! But I have one question, I tried to use this code with the RIDE7, while it'is using the GNU compiler, but I have met some errors. What toolchain have you used? and how I can use the C preprocessor with an assembly file. Thanks to all Regards sword_82