Showing results for 
Search instead for 
Did you mean: 

Why things run (significantly) slower in certain optimization settings...? Resources for learning ASM.

Donald Bosley
Associate II
Posted on August 26, 2017 at 07:03

ARM CMSIS DSP library exploits circular shifts of buffers rather than circular buffers, in order to keep items in bounds during loops rather than expend instructions on branching to check in-bounds statements and wrapping pointers back to the beginning of buffers. So for an application I'm working on with STM32F767ZG, I tried memcpy, and the performance is terrible regardless of the optimization setting. Instead, I wrote a circular shift function to do this with one of my buffers prior to running it through processing stages. 

What's interesting is loop unrolling beyond a certain number, actually grossly decreases the performance. Also, the most optimal performance was achieved with O1, which I find weird. 

Example 1 : I'm going to shift 992 floats from  32:1023, down to 0:991.

On function entry, I pre-compute the values I need...

void delay_line_circ_shift(float32_t * delay_line)

   uint32_t k = 996;

   float32_t * delay_line1 = &delay_line[0];

   float32_t * delay_line2 = &delay_line[32];

then perform the relocation as follows : 


      *delay_line++ = *delay_line2++;


At O0, after 1,000 tests, the worst case is ~14,000 cycles. Looking at assembly, I see individual LD, ADD, STR for everything, which obviously doesn't take advantage of the M7's pipeline.  

At O1, worst case is 2176 cycles - attributable to the complier's use of LDR.W and STR.W, and interleaving them to take advantage of the pipeline. 

At O2 and O3 the code blows back up and takes over 11,000 cycles, separating loads and stores into groups of loads and stores rather than interleaving them. 

In no case does it use a compare and branch instruction when the while loop is distinctly set up to compare against 0 and branch if necessary... also something I don't understand.

SO, WHY? Why does it do a 'worse' job of compiling for the chip architecture when the optimization level is increased? 

Example 2 : I'm going to run the same function but manually unroll the loop by various factors (and alter my starting and k-= value accordingly). So O1 for all...

1x --> 2176

2x -->1837

4x -->1692

8x --> 4167, and a continual decrease in performance with every increase in loop unroll...


SO, WHY? Why does it blow back up when I unroll by a factor of 8, 16, 32, etc... I don't have either cache enabled, so this shouldn't affect performance negatively. Is this due to the number of special registers available or something along those lines?

With that said, what I'm realizing is in certain cases, I can't/shouldn't rely on the compiler if it's not producing the most optimal results - especially in frequently used, small utility functions like this. So, having no real background in CS, I'm wondering if anyone can point me to good basic resources for learning how to create/write ASM or inline ASM. There is a semi-helpful section in Yui's Definitive Guide to Cortex-M but I need a more thorough, tutorial like overview. I was able to teach myself C from K&R and Linden's Deep C Secrets, so something along those lines would be great.

#arm-compiler #stm32f7 #assembly #compiler-optimization
Posted on August 26, 2017 at 14:48

GCC believes that the branching command is the slowest for the processor, in part, so it is. When optimizing GCC in every way trying to get rid of such a command, opening the loops to a set of the same type of commands. But in the GCC there is no algorithm for comparing slow-executing commands, there are conditions for choosing one or another strategy for code assembly. That is, the GCC does not check the options for the speed, it simply selects the pre-compiled code composition.

In your case, the read / write commands are slower than the branch command, for this reason GCC does not use the command: read the address ++, write the address ++. These commands are executed one clock slower than simply: read the address, write the address. Additional command: address ++ - is executed separately, at the time the command is executed: read the address, write the address. This is a property of the pipeline, and it works.

When GCC expands the loop, an additional command: address ++ is used for each memory operation. Even in this case, the pipeline continues to work, and significantly speeds up the code. But here a rake comes from the processor itself, more precisely from its flash memory. While the executable code can fit into one complete pre-selection flash memory - everything flies. But as soon as a new section of code is read (outside the first sample), the processor has to wait for the slowest hardware part of the crystal, flash memory.

GCC does not know how to independently predict the instruction time from flash memory. More precisely, but no one told him about this. For successful manipulation of the location of the code - you will have to write a lot of scripts for the GCC itself, this is already the level of the programmer in assembly language.

In order for the GCC not to engage in amateur activities - you need to write what should be done, regardless of the warnings of the GCC itself.

void delay_line_circ_shift(uint32_t * delay_line)


   uint32_t *line1 = &delay_line; // the data address is used, the data format does not matter

   uint32_t *line2 = line1 + 32;

   uint32_t lineoff = line1 + 996;

   do{ *line1++ = *line2++;} while (line1 != lineoff);

Posted on August 27, 2017 at 03:08

When performance is needed, some understanding of how the generated asm code works is needed to write C code the intended way. Compilers are not aware of CCRAM or caches, nor it will use HW DMA for memory block transfers.

Posted on August 29, 2017 at 11:24

Interesting thorough analysis - thank you.


But here a rake comes from the processor itself, more precisely from its flash memory.' - I think that's the part that answers the question on why when I manually unroll the loop beyond a certain factor, it takes a performance dump - so what's weird is turning on branch prediction / pre-fetch should eliminate the branch penalty.  

So I guess my question is short of trial and error, which is what I've been doing to see what runs fastest, what would you recommend to do in order to learn what C compiles into the most efficient code for the architecture? Most application notes just give the old 'interleave loads and stores to take advantage of the pipeline...' I know a some of the classic tricks like count down instead of up for a combining the compare to 0 + branch instruction and things like that, but they are very general cases.

I'm going to benchmark your version at each optimization level and see how it does...