Fast FIR - less than one cycle per tap

Any interest in a block FIR routine (16 bit data) that can execute (on average) 1 tap in less than 1 cycle?

The catch: It is an 8 tap filter (result not rounded).

The filter has low start and end overhead and can be easily cascaded. Calculating coefficients for cascaded sections is a bit of a mystery though. Data block size has to be a multiple of 8.

When I finally get my hands on some STM32L4 hardware I can verify/optimise the load/store cycles and tidy up the code.

Not sure how useful this thing is, but it was certainly good assembly practice.