2023-05-18 11:49 AM
I am using the CORDIC on an STM32G4 for a project and was curious about the difference in performance based on the compiler optimization level. With no optimizations, LL_CORDIC_FUNCTION_PHASE completed in 113 cycles. With O1 optimizations, LL_CORDIC_FUNCTION_PHASE completed in 31 cycles.
I'm a bit confused by the huge disparity in execution times with and without optimization. Since the trig calculations occur in the CORDIC hardware, I wouldn't think that optimizations would have any effect. Additionally, the LL CORDIC function calls are simply inlines that access the peripheral registers, I wouldn't think there would be much to optimize there either.
I took a look at the generated assembly and while I don't fully understand it, it seems that the difference is primarily in additional debug features. This makes sense, but I would like some confirmation that this is the case. Thanks.
g_start_time = SysTick->VAL;
LL_CORDIC_WriteData(hcordic.Instance, cordic_input[0]);
LL_CORDIC_WriteData(hcordic.Instance, cordic_input[1]);
cordic_output = (int32_t)LL_CORDIC_ReadData(hcordic.Instance);
g_stop_time = SysTick->VAL;
g_elapsed_time = g_start_time - g_stop_time;
Solved! Go to Solution.
2023-05-18 03:51 PM
The unoptimized version calls a function
bl LL_CORDIC_WriteData
This means the function call overhead, jumps-associated delays (prefetch reload etc.) and also moving around data because of the ABI requiring function parameters to be in certain registers and r0-r3 registers are also not preserved.
The optimized version has inlined the function.
Note, that the standard C "inline" keyword (function specifier) is a *hint*, i.e. compilers are not bound to obey it. gcc implements an "always inline" attribute which enforces inlining; but I personally don't see any reason to avoid optimization.
JW
2023-05-18 12:12 PM
Post disasm for this section for both optimization settings.
JW
2023-05-18 12:23 PM
Hmm. if the cordic are HW memory mapped registers with a done flag to poll, then checking out the duration for a cordic operation should have minimal memory transfer operations.
Use the Debug Cycle counter to take a timestamp when writing the command/execution byte onto the CORDIC and take a snapshot once the CORDIC is done. The difference may be less dependent on the compiler. Systick granularity is way less than the core cycle counter.
2023-05-18 02:32 PM
Here is the unoptimized assembly:
cordic_atan2:
.LFB164:
.loc 2 234 1
.cfi_startproc
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 1, uses_anonymous_args = 0
push {r7, lr}
.cfi_def_cfa_offset 8
.cfi_offset 7, -8
.cfi_offset 14, -4
add r7, sp, #0
.cfi_def_cfa_register 7
.loc 2 235 3
ldr r3, .L20
ldr r3, [r3]
.loc 2 235 53
ldr r2, .L20+4
ldr r2, [r2]
.loc 2 235 3
mov r1, r2
mov r0, r3
bl LL_CORDIC_WriteData
.loc 2 236 3
ldr r3, .L20
ldr r3, [r3]
.loc 2 236 53
ldr r2, .L20+4
ldr r2, [r2, #4]
.loc 2 236 3
mov r1, r2
mov r0, r3
bl LL_CORDIC_WriteData
.loc 2 237 28
ldr r3, .L20
ldr r3, [r3]
mov r0, r3
bl LL_CORDIC_ReadData
mov r3, r0
.loc 2 237 19
mov r2, r3
.loc 2 237 17
ldr r3, .L20+8
str r2, [r3]
.loc 2 238 1
nop
pop {r7, pc}
.L21:
.align 2
.L20:
.word hcordic
.word cordic_input
.word cordic_output
.cfi_endproc
And here is the O1 optimized assembly:
cordic_atan2:
.LFB164:
.loc 1 234 1 is_stmt 1 view -0
.cfi_startproc
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
.loc 1 235 3 view .LVU154
ldr r3, .L21
ldr r3, [r3]
.loc 1 235 53 is_stmt 0 view .LVU155
ldr r2, .L21+4
.loc 1 235 3 view .LVU156
ldr r1, [r2]
.LVL27:
.LBB42:
.LBI42:
.loc 3 732 20 is_stmt 1 view .LVU157
.LBB43:
.loc 3 734 3 view .LVU158
.loc 3 734 21 is_stmt 0 view .LVU159
str r1, [r3, #4]
.LVL28:
.loc 3 734 21 view .LVU160
.LBE43:
.LBE42:
.loc 1 236 3 is_stmt 1 view .LVU161
ldr r2, [r2, #4]
.LVL29:
.LBB44:
.LBI44:
.loc 3 732 20 view .LVU162
.LBB45:
.loc 3 734 3 view .LVU163
.loc 3 734 21 is_stmt 0 view .LVU164
str r2, [r3, #4]
.LVL30:
.loc 3 734 21 view .LVU165
.LBE45:
.LBE44:
.loc 1 237 3 is_stmt 1 view .LVU166
.LBB46:
.LBI46:
.loc 3 743 24 view .LVU167
.LBB47:
.loc 3 745 3 view .LVU168
.loc 3 745 10 is_stmt 0 view .LVU169
ldr r2, [r3, #8]
.LVL31:
.loc 3 745 10 view .LVU170
.LBE47:
.LBE46:
.loc 1 237 17 view .LVU171
ldr r3, .L21+8
str r2, [r3]
.loc 1 238 1 view .LVU172
bx lr
.L22:
.align 2
.L21:
.word hcordic
.word .LANCHOR0
.word .LANCHOR2
.cfi_endproc
2023-05-18 03:51 PM
The unoptimized version calls a function
bl LL_CORDIC_WriteData
This means the function call overhead, jumps-associated delays (prefetch reload etc.) and also moving around data because of the ABI requiring function parameters to be in certain registers and r0-r3 registers are also not preserved.
The optimized version has inlined the function.
Note, that the standard C "inline" keyword (function specifier) is a *hint*, i.e. compilers are not bound to obey it. gcc implements an "always inline" attribute which enforces inlining; but I personally don't see any reason to avoid optimization.
JW