2019-07-29 09:05 PM
First a quick description on the scenario:
Code: X-CUBE-32F7PERF attached) ->Project stm32f7_performances
Hardware: STM32F723 Discovery RevD01
IDE: MDK 5.27
Compiler: ARM Compiler 5.06, O3-Optimization and also optimized for time
CMSIS DSP: 1.5.2 / 1.7.0
Related Doc: AN 4667 STM32F7 Series system architecture and performance (attached)
CMSIS DSP Revisioning: Change History
====================================================================
This project basically runs a 1024 pts complex FFT, calculates the magnitude then find the max value, all using CMSIS DSP. Finally it calculates how many cycles it takes for these routines using sysTick (why not simply just a 32 bit timer?). The project also demonstrates different memory settings so that we can compare their performances.
Now what I observed is that in my initial setting, I configure the code to be ITCM-RAM and data in DTCM(project target 5 in MDK setting), the resulting cycles are a bit worse than executed in FLASH-ITCM!! Later I found that different CMSIS DSP has HUGE impact on this!
With same code in ITCM-RAM and data in DTCM setting (project target 5 in MDK), if we choose different CMSIS DSP version, the resulting cycles are significantly different! Here are my result:
CMSIS DSP 1.5.2 ==> 106193 cycles
CMSIS DSP 1.7.0 ==> 144520 cycles!! (36% more, what...?!)
====================================================================
So far, I've checked that FPU is really running. Also had a quick exam on map file, make sure all math codes are trulely in 0x0000 0000 ~ 0x0000 3FFFF ITCM region. Map files are also attached with different DSP version. If the memory setting is on ITCM-Flash or AXI-Flash, the results are only slightly different.
====================================================================
This is really wired, does any F7 or DSP expert know what's going on here?
Really thanks for reading this question,
Zt.
2019-07-30 01:46 AM
Hey, after taking a further look on the map file, I just found that the when using newer CMSIS DSP call a sqrtf function which lies in FLASH-ITCM region, not ITCM!
0x0020212c 0x0020212c 0x0000003a Code RO 2750 i.__hardfp_sqrtf m_wm.l(sqrtf.o)
If I edit the sct file, so that this sqrt to be executed in ITCM, the newer map file is
0x00000da8 0x00202ca8 0x0000003a Code RO 2741 i.__hardfp_sqrtf m_wm.l(sqrtf.o)
The resulting cycles drops to about 125057, still about 17% inefficient.
I've also noticed that the older CMSIS DSP doesn't call this __hardfp_sqrtf routine, but only some inline FP operations.
__hardfp_sqrtf in CMSIS DSP 1.7.0
0x00000DA8 B510 PUSH {r4,lr}
0x00000DAA ED2D8B02 VPUSH.64 {d8-d8}
0x00000DAE EEB18AC0 VSQRT.F32 s16,s0
0x00000DB2 EE180A10 VMOV r0,s16
0x00000DB6 F0204000 BIC r0,r0,#0x80000000
0x00000DBA F1C040FF RSB r0,r0,#0x7F800000
0x00000DBE 0FC0 LSRS r0,r0,#31
0x00000DC0 D00A BEQ 0x00000DD8
0x00000DC2 EE100A10 VMOV r0,s0
0x00000DC6 F0204000 BIC r0,r0,#0x80000000
0x00000DCA F1C040FF RSB r0,r0,#0x7F800000
0x00000DCE 0FC0 LSRS r0,r0,#31
0x00000DD0 BF04 ITT EQ
0x00000DD2 2001 MOVS r0,#0x01
0x00000DD4 F200FB00 BL.W __set_errno (0x002013D8)
0x00000DD8 EEB00A48 VMOV.F32 s0,s16
0x00000DDC ECBD8B02 VPOP.64 {d8-d8}
0x00000DE0 BD10 POP {r4,pc}
0x00000DE2 0000 MOVS r0,r0
So probably this is the crucial part casing the different performances.
Any ideas?
.zt
2019-08-16 03:00 AM
Hello @Community member ,
To get rid of this issue, you need to add the sqrtf.o to your execution region in the scatter file.
For example, in M7-ITCM_rwDTCM.sct:
EXEC_REGION_ITCM 0x00000000 0x10000
{
utilities.o (+RO)
stm32h7xx_it.o (+RO)
main.o (+RO-CODE)
arm_bitreversal2.o (+RO-CODE)
arm_cfft_f32.o (+RO-CODE)
arm_cfft_radix8_f32.o (+RO-CODE)
arm_cmplx_mag_f32.o (+RO-CODE)
arm_max_f32.o (+RO-CODE)
sqrtf.o (+RO-CODE)
; Place also const data in ITCM-RAM.
main.o (+RO-DATA)
arm_common_tables.o (+RO-DATA)
arm_const_structs.o (+RO-DATA)
}
Please test this, and let me know if the number of cycles is reduced, as expected.
thanks,
Amel
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.