2015-09-03 07:17 AM
Any interest in a block FIR routine (16 bit data) that can execute (on average) 1 tap in less than 1 cycle?
The catch: It is an 8 tap filter (result not rounded).The filter has low start and end overhead and can be easily cascaded. Calculating coefficients for cascaded sections is a bit of a mystery though. Data block size has to be a multiple of 8.When I finally get my hands on some STM32L4 hardware I can verify/optimise the load/store cycles and tidy up the code.Not sure how useful this thing is, but it was certainly good assembly practice.2015-09-23 02:46 AM
Tested and tweaked with actual STM32L4 hardware now. Result: 15 cycles for 16 taps = 0.9374 cycles per tap. So the M4 can be faster than a standard DSP - at least for an 8 tap filter. Will keep researching into how to cascade these things.
2015-09-23 03:21 AM
Thank you, surely it'll be interesting to see. It'll be wonderful if you could share the code on GitHub.