Why does CORDIC performance vary so much with optimizations?

BPaik · ‎2023-05-18

I am using the CORDIC on an STM32G4 for a project and was curious about the difference in performance based on the compiler optimization level. With no optimizations, LL_CORDIC_FUNCTION_PHASE completed in 113 cycles. With O1 optimizations, LL_CORDIC_FUNCTION_PHASE completed in 31 cycles.

I'm a bit confused by the huge disparity in execution times with and without optimization. Since the trig calculations occur in the CORDIC hardware, I wouldn't think that optimizations would have any effect. Additionally, the LL CORDIC function calls are simply inlines that access the peripheral registers, I wouldn't think there would be much to optimize there either.

I took a look at the generated assembly and while I don't fully understand it, it seems that the difference is primarily in additional debug features. This makes sense, but I would like some confirmation that this is the case. Thanks.

    g_start_time = SysTick->VAL;
 
    LL_CORDIC_WriteData(hcordic.Instance, cordic_input[0]);
    LL_CORDIC_WriteData(hcordic.Instance, cordic_input[1]);
    cordic_output = (int32_t)LL_CORDIC_ReadData(hcordic.Instance);
 
    g_stop_time = SysTick->VAL;
    g_elapsed_time = g_start_time - g_stop_time;

waclawek.jan · ‎2023-05-18

The unoptimized version calls a function

bl LL_CORDIC_WriteData

This means the function call overhead, jumps-associated delays (prefetch reload etc.) and also moving around data because of the ABI requiring function parameters to be in certain registers and r0-r3 registers are also not preserved.

The optimized version has inlined the function.

Note, that the standard C "inline" keyword (function specifier) is a *hint*, i.e. compilers are not bound to obey it. gcc implements an "always inline" attribute which enforces inlining; but I personally don't see any reason to avoid optimization.

JW

View solution in original post

waclawek.jan · ‎2023-05-18

Post disasm for this section for both optimization settings.

JW

S.Ma · ‎2023-05-18

Hmm. if the cordic are HW memory mapped registers with a done flag to poll, then checking out the duration for a cordic operation should have minimal memory transfer operations.

Use the Debug Cycle counter to take a timestamp when writing the command/execution byte onto the CORDIC and take a snapshot once the CORDIC is done. The difference may be less dependent on the compiler. Systick granularity is way less than the core cycle counter.

BPaik · ‎2023-05-18

Here is the unoptimized assembly:

cordic_atan2:
.LFB164:
	.loc 2 234 1
	.cfi_startproc
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 1, uses_anonymous_args = 0
	push	{r7, lr}
	.cfi_def_cfa_offset 8
	.cfi_offset 7, -8
	.cfi_offset 14, -4
	add	r7, sp, #0
	.cfi_def_cfa_register 7
	.loc 2 235 3
	ldr	r3, .L20
	ldr	r3, [r3]
	.loc 2 235 53
	ldr	r2, .L20+4
	ldr	r2, [r2]
	.loc 2 235 3
	mov	r1, r2
	mov	r0, r3
	bl	LL_CORDIC_WriteData
	.loc 2 236 3
	ldr	r3, .L20
	ldr	r3, [r3]
	.loc 2 236 53
	ldr	r2, .L20+4
	ldr	r2, [r2, #4]
	.loc 2 236 3
	mov	r1, r2
	mov	r0, r3
	bl	LL_CORDIC_WriteData
	.loc 2 237 28
	ldr	r3, .L20
	ldr	r3, [r3]
	mov	r0, r3
	bl	LL_CORDIC_ReadData
	mov	r3, r0
	.loc 2 237 19
	mov	r2, r3
	.loc 2 237 17
	ldr	r3, .L20+8
	str	r2, [r3]
	.loc 2 238 1
	nop
	pop	{r7, pc}
.L21:
	.align	2
.L20:
	.word	hcordic
	.word	cordic_input
	.word	cordic_output
	.cfi_endproc

And here is the O1 optimized assembly:

cordic_atan2:
.LFB164:
	.loc 1 234 1 is_stmt 1 view -0
	.cfi_startproc
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	.loc 1 235 3 view .LVU154
	ldr	r3, .L21
	ldr	r3, [r3]
	.loc 1 235 53 is_stmt 0 view .LVU155
	ldr	r2, .L21+4
	.loc 1 235 3 view .LVU156
	ldr	r1, [r2]
.LVL27:
.LBB42:
.LBI42:
	.loc 3 732 20 is_stmt 1 view .LVU157
.LBB43:
	.loc 3 734 3 view .LVU158
	.loc 3 734 21 is_stmt 0 view .LVU159
	str	r1, [r3, #4]
.LVL28:
	.loc 3 734 21 view .LVU160
.LBE43:
.LBE42:
	.loc 1 236 3 is_stmt 1 view .LVU161
	ldr	r2, [r2, #4]
.LVL29:
.LBB44:
.LBI44:
	.loc 3 732 20 view .LVU162
.LBB45:
	.loc 3 734 3 view .LVU163
	.loc 3 734 21 is_stmt 0 view .LVU164
	str	r2, [r3, #4]
.LVL30:
	.loc 3 734 21 view .LVU165
.LBE45:
.LBE44:
	.loc 1 237 3 is_stmt 1 view .LVU166
.LBB46:
.LBI46:
	.loc 3 743 24 view .LVU167
.LBB47:
	.loc 3 745 3 view .LVU168
	.loc 3 745 10 is_stmt 0 view .LVU169
	ldr	r2, [r3, #8]
.LVL31:
	.loc 3 745 10 view .LVU170
.LBE47:
.LBE46:
	.loc 1 237 17 view .LVU171
	ldr	r3, .L21+8
	str	r2, [r3]
	.loc 1 238 1 view .LVU172
	bx	lr
.L22:
	.align	2
.L21:
	.word	hcordic
	.word	.LANCHOR0
	.word	.LANCHOR2
	.cfi_endproc

waclawek.jan · ‎2023-05-18

The unoptimized version calls a function

bl LL_CORDIC_WriteData

This means the function call overhead, jumps-associated delays (prefetch reload etc.) and also moving around data because of the ABI requiring function parameters to be in certain registers and r0-r3 registers are also not preserved.

The optimized version has inlined the function.

Note, that the standard C "inline" keyword (function specifier) is a *hint*, i.e. compilers are not bound to obey it. gcc implements an "always inline" attribute which enforces inlining; but I personally don't see any reason to avoid optimization.

JW