I could not find any reasonable fast FFT and IFFT routine for STM32 so I wrote my own in highly optimized assembly. FFT runs at around 70 - 75% of speed when compared with STR910 implementation(at same clock frequency). It requires roughly same amount of cycles as dsPIC optimized FFT. I estimate it is almost twice so fast as ARM7 FFT.
Enjoy :-D
Some highlights:
- Radix 4 FFT /IFFT - supported sizes: N=4,16,64,256,1024,4096
- N>4096 possible with custom coefficients
- 16 bit complex arithmetic, 1Q15 coefficients
- input data remains unmodified
- decimation-in-time with auto scale after each stage - no overflow
- GCC version (Code Sourcery G++ Lite 2007q3-53), requires C preprocessor
- hand optimized THUMB2 assembly for Cortex-M3 (tested with STM32)
- code optimized with 64 bit flash prefetch and branch speculation in mind
- single function for multiple FFT sizes
- functions are both "C" and assembly callable
STM32 FFT benchmarks in CPU cycles based on real hardware measurements:
N - FFT size
L - Flash latency
F,R - coefficients in Flash or RAM
Notes:ARM9E - DSP enhanced arm9 core, based on STR910 @96 MHz, RAM coefficients
dsPIC - dsPIC30x / dsPIC33x - results based on Microchip's DSP library
ARM7TDMI - 3rd party, web based benchmark results
*STM32 results for latency L=2 or - source compiled with LATENCY2 defined
IFFT benchmarks (calculated and verified by measurement):
add 6+N/4 to FFT benchmark {22,70,262} for N={64,256,1024}
>the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting?
Exactly. Speed improvement is around 1% for this case (Latency1, ram coeff), nevertheless it is faster. Optimizing for flash latency above 0 is tricky, there are always trade-offs (e.g. wider form of instruction aligns branch target below to 64 prefetch boundary, but cost is extra pre-fetch cycle in certain conditions)
>the comments regarding version 2.0 are a leftover from a previous project or something like that?
Version 2.0 is not finished yet, other projects have higher priority. I'll post it when its finished and fully tested.
I see your complex FFT for STM32. I'm quite young with this and I have four questions:
1. In which order must be data in input sequence, normal or digit-order?
2. In which order iz data in output sequence, normal or digit-inverse odred?
3. If the input stream is real what I must done to get good fft result?
4. When Version 2.0 will be finished?
Hi everybody,
I'm interested in developing a FFT application, and recently i got a IAR STM32-SK developement board.
I'm quite young with this kind of family, so my question is: has anyone got an example project so that I can "move on it" doing my tests?
Quick answers:
>In which order must be data in input sequence, normal or digit-order?
input is in normal ordel e.g. [r0 i0 r1 i1 r2 i2... ] where r is real part, i is imaginary part
>In which order iz data in output sequence, normal or digit-inverse odred?
output is in normal order (bit reversal performed in asm routine)
> If the input stream is real what I must done to get good fft result?
a) keep imaginary part 0 (inefficient)
b) calculate 2 real ffts of size N in the same time (real part = real input1; imag part= real input2), postprocessing required to separate outputs
c) calculate real FFT with size 2N (similar to b) )
d) for processing in the frequency domain (real input, real filter) no postprocessing required e.g. real Left channel, real Right channel, real Filter impulse response
FFT1: real part = filter impulse response, imag part =0; calaculate once as filter is constant; output - complex F1 sequence
FFT2: real part = Left channel imag part =Right channel; output complex F2 sequence
F3 = F1*F2 (N complex multiplies)
z=IFFT(F3) ; complex sequence where real is Left channel after frequency domain processing; imag is Right channel afte freq. dom. processing
left and righ channel independent.
Find some papers on FFT theory and convolution in the frequency domain as there are aliasing issues.
> When Version 2.0 will be finished?
I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code :-)
for more theory and fft of real sequences check: www.engineeringproductivitytools.com/stuff/T0001/index.html
> When Version 2.0 will be finished?
I've finished optimized radix 2 FFT (size 8,32,126,512,2048) in June, I'll post it when I find time to clean the code
Is it possible to provide just the FFT Benchmarks cycles you get with Radix-2 (size 8,32,126,512,2048) as you did with Radix-4 ?
All Forums users Thank you a lot for your FFT development for STM32 :-)
I was using Crossworks for ARM. I did not have time to try RIDE7 to compile it, but it should not be big problem as tis is stand-alone assembler file. What errors were generated by RIDE7?
Good work! But I have one question, I tried to use this code with the RIDE7, while it'is using the GNU compiler, but I have met some errors. What toolchain have you used? and how I can use the C preprocessor with an assembly file.
Thanks to all
The RIDE7 (frome Raisonance, and using GNU compiler) doesn't supporting C preprocessor for the assembly code. So all the "#define"s are not supported! I think that I will replace every defined variable directly by its corresponding register ?!
Thanks for your reply and for the good work!
I really appreciate your fast answer. I have tried to open your web page at embedded Signals dot com, Can I contact you on your professional E-mail ?
I would like to ask you if you are planning or already have implemented the 32-bits data inputs as well. If yes, I would be very grateful if you can give me a rough estimation about Radix-4 and Radix-2 implementation for 512 and 1024 points where coefficients are in RAM.
Hi STOne-32,
sure, if you want to contact me, my email is in posted FFTCM3.s file.
Regarding 32 bit FFT version, I do not have it. In principle it is not that difficult to convert existing 16 bit version, as variables in registers are 32 bits already, only multiply, load/store and coefficients has to be upgraded to 32 bits. So far I needed only 16 bit precision, so I was not motivated to write 32 bit version.
Source code for complex 16 bit Radix 2 FFT (odd powers of 2) for Cortex-M3; N= 8, 32, 128, 512 and 2048 points
Hi all,
I'm getting emails regarding availability of Radix 2 FFT, so I did minor code cleaning and got it ready for posting. Please see attachment for source code.
Enjoy, if you notice some problems let me know.
the case of flash latency 1 with coeffs in ram is benchmarked with LATENCY2 defined. why? is this the fastest setting?
the comments regarding version 2.0 are a leftover from a previous project or something like that?