ARM CMSIS DSP library exploits circular shifts of buffers rather than circular buffers, in order to keep items in bounds during loops rather than expend instructions on branching to check in-bounds statements and wrapping pointers back to the beginning of buffers. So for an application I'm working on with STM32F767ZG, I tried memcpy, and the performance is terrible regardless of the optimization setting. Instead, I wrote a circular shift function to do this with one of my buffers prior to running it through processing stages.
What's interesting is loop unrolling beyond a certain number, actually grossly decreases the performance. Also, the most optimal performance was achieved with O1, which I find weird.
Example 1 : I'm going to shift 992 floats from 32:1023, down to 0:991.
On function entry, I pre-compute the values I need...
void delay_line_circ_shift(float32_t * delay_line)
uint32_t k = 996;
float32_t * delay_line1 = &delay_line;
float32_t * delay_line2 = &delay_line;
then perform the relocation as follows :
*delay_line++ = *delay_line2++;
At O0, after 1,000 tests, the worst case is ~14,000 cycles. Looking at assembly, I see individual LD, ADD, STR for everything, which obviously doesn't take advantage of the M7's pipeline.
At O1, worst case is 2176 cycles - attributable to the complier's use of LDR.W and STR.W, and interleaving them to take advantage of the pipeline.
At O2 and O3 the code blows back up and takes over 11,000 cycles, separating loads and stores into groups of loads and stores rather than interleaving them.
In no case does it use a compare and branch instruction when the while loop is distinctly set up to compare against 0 and branch if necessary... also something I don't understand.
SO, WHY? Why does it do a "worse" job of compiling for the chip architecture when the optimization level is increased?
Example 2 : I'm going to run the same function but manually unroll the loop by various factors (and alter my starting and k-= value accordingly). So O1 for all...
1x --> 2176
8x --> 4167, and a continual decrease in performance with every increase in loop unroll...
SO, WHY? Why does it blow back up when I unroll by a factor of 8, 16, 32, etc... I don't have either cache enabled, so this shouldn't affect performance negatively. Is this due to the number of special registers available or something along those lines?
With that said, what I'm realizing is in certain cases, I can't/shouldn't rely on the compiler if it's not producing the most optimal results - especially in frequently used, small utility functions like this. So, having no real background in CS, I'm wondering if anyone can point me to good basic resources for learning how to create/write ASM or inline ASM. There is a semi-helpful section in Yui's Definitive Guide to Cortex-M but I need a more thorough, tutorial like overview. I was able to teach myself C from K&R and Linden's Deep C Secrets, so something along those lines would be great.