Skip to main content
Senior
May 18, 2023
Solved

Why does CORDIC performance vary so much with optimizations?

  • May 18, 2023
  • 3 replies
  • 1990 views

I am using the CORDIC on an STM32G4 for a project and was curious about the difference in performance based on the compiler optimization level. With no optimizations, LL_CORDIC_FUNCTION_PHASE completed in 113 cycles. With O1 optimizations, LL_CORDIC_FUNCTION_PHASE completed in 31 cycles.

I'm a bit confused by the huge disparity in execution times with and without optimization. Since the trig calculations occur in the CORDIC hardware, I wouldn't think that optimizations would have any effect. Additionally, the LL CORDIC function calls are simply inlines that access the peripheral registers, I wouldn't think there would be much to optimize there either.

I took a look at the generated assembly and while I don't fully understand it, it seems that the difference is primarily in additional debug features. This makes sense, but I would like some confirmation that this is the case. Thanks.

 g_start_time = SysTick->VAL;
 
 LL_CORDIC_WriteData(hcordic.Instance, cordic_input[0]);
 LL_CORDIC_WriteData(hcordic.Instance, cordic_input[1]);
 cordic_output = (int32_t)LL_CORDIC_ReadData(hcordic.Instance);
 
 g_stop_time = SysTick->VAL;
 g_elapsed_time = g_start_time - g_stop_time;

This topic has been closed for replies.
Best answer by waclawek.jan

The unoptimized version calls a function

bl   LL_CORDIC_WriteData

This means the function call overhead, jumps-associated delays (prefetch reload etc.) and also moving around data because of the ABI requiring function parameters to be in certain registers and r0-r3 registers are also not preserved.

The optimized version has inlined the function.

Note, that the standard C "inline" keyword (function specifier) is a *hint*, i.e. compilers are not bound to obey it. gcc implements an "always inline" attribute which enforces inlining; but I personally don't see any reason to avoid optimization.

JW

3 replies

waclawek.jan
Super User
May 18, 2023

Post disasm for this section for both optimization settings.

JW

BPaikAuthor
Senior
May 18, 2023

Here is the unoptimized assembly:

cordic_atan2:
.LFB164:
	.loc 2 234 1
	.cfi_startproc
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 1, uses_anonymous_args = 0
	push	{r7, lr}
	.cfi_def_cfa_offset 8
	.cfi_offset 7, -8
	.cfi_offset 14, -4
	add	r7, sp, #0
	.cfi_def_cfa_register 7
	.loc 2 235 3
	ldr	r3, .L20
	ldr	r3, [r3]
	.loc 2 235 53
	ldr	r2, .L20+4
	ldr	r2, [r2]
	.loc 2 235 3
	mov	r1, r2
	mov	r0, r3
	bl	LL_CORDIC_WriteData
	.loc 2 236 3
	ldr	r3, .L20
	ldr	r3, [r3]
	.loc 2 236 53
	ldr	r2, .L20+4
	ldr	r2, [r2, #4]
	.loc 2 236 3
	mov	r1, r2
	mov	r0, r3
	bl	LL_CORDIC_WriteData
	.loc 2 237 28
	ldr	r3, .L20
	ldr	r3, [r3]
	mov	r0, r3
	bl	LL_CORDIC_ReadData
	mov	r3, r0
	.loc 2 237 19
	mov	r2, r3
	.loc 2 237 17
	ldr	r3, .L20+8
	str	r2, [r3]
	.loc 2 238 1
	nop
	pop	{r7, pc}
.L21:
	.align	2
.L20:
	.word	hcordic
	.word	cordic_input
	.word	cordic_output
	.cfi_endproc

And here is the O1 optimized assembly:

cordic_atan2:
.LFB164:
	.loc 1 234 1 is_stmt 1 view -0
	.cfi_startproc
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	.loc 1 235 3 view .LVU154
	ldr	r3, .L21
	ldr	r3, [r3]
	.loc 1 235 53 is_stmt 0 view .LVU155
	ldr	r2, .L21+4
	.loc 1 235 3 view .LVU156
	ldr	r1, [r2]
.LVL27:
.LBB42:
.LBI42:
	.loc 3 732 20 is_stmt 1 view .LVU157
.LBB43:
	.loc 3 734 3 view .LVU158
	.loc 3 734 21 is_stmt 0 view .LVU159
	str	r1, [r3, #4]
.LVL28:
	.loc 3 734 21 view .LVU160
.LBE43:
.LBE42:
	.loc 1 236 3 is_stmt 1 view .LVU161
	ldr	r2, [r2, #4]
.LVL29:
.LBB44:
.LBI44:
	.loc 3 732 20 view .LVU162
.LBB45:
	.loc 3 734 3 view .LVU163
	.loc 3 734 21 is_stmt 0 view .LVU164
	str	r2, [r3, #4]
.LVL30:
	.loc 3 734 21 view .LVU165
.LBE45:
.LBE44:
	.loc 1 237 3 is_stmt 1 view .LVU166
.LBB46:
.LBI46:
	.loc 3 743 24 view .LVU167
.LBB47:
	.loc 3 745 3 view .LVU168
	.loc 3 745 10 is_stmt 0 view .LVU169
	ldr	r2, [r3, #8]
.LVL31:
	.loc 3 745 10 view .LVU170
.LBE47:
.LBE46:
	.loc 1 237 17 view .LVU171
	ldr	r3, .L21+8
	str	r2, [r3]
	.loc 1 238 1 view .LVU172
	bx	lr
.L22:
	.align	2
.L21:
	.word	hcordic
	.word	.LANCHOR0
	.word	.LANCHOR2
	.cfi_endproc

S.Ma
Principal
May 18, 2023

Hmm. if the cordic are HW memory mapped registers with a done flag to poll, then checking out the duration for a cordic operation should have minimal memory transfer operations.

Use the Debug Cycle counter to take a timestamp when writing the command/execution byte onto the CORDIC and take a snapshot once the CORDIC is done. The difference may be less dependent on the compiler. Systick granularity is way less than the core cycle counter.

waclawek.jan
waclawek.janBest answer
Super User
May 18, 2023

The unoptimized version calls a function

bl   LL_CORDIC_WriteData

This means the function call overhead, jumps-associated delays (prefetch reload etc.) and also moving around data because of the ABI requiring function parameters to be in certain registers and r0-r3 registers are also not preserved.

The optimized version has inlined the function.

Note, that the standard C "inline" keyword (function specifier) is a *hint*, i.e. compilers are not bound to obey it. gcc implements an "always inline" attribute which enforces inlining; but I personally don't see any reason to avoid optimization.

JW