Newer CMSIS DSP has huge impact on ITCM-Ram execution!!

Zt Liu · ‎2019-07-29

First a quick description on the scenario:

Code: X-CUBE-32F7PERF attached) ->Project stm32f7_performances

Hardware: STM32F723 Discovery RevD01

IDE: MDK 5.27

Compiler: ARM Compiler 5.06, O3-Optimization and also optimized for time

CMSIS DSP: 1.5.2 / 1.7.0

Related Doc: AN 4667 STM32F7 Series system architecture and performance (attached)

CMSIS DSP Revisioning: Change History

====================================================================

This project basically runs a 1024 pts complex FFT, calculates the magnitude then find the max value, all using CMSIS DSP. Finally it calculates how many cycles it takes for these routines using sysTick (why not simply just a 32 bit timer?). The project also demonstrates different memory settings so that we can compare their performances.

Now what I observed is that in my initial setting, I configure the code to be ITCM-RAM and data in DTCM(project target 5 in MDK setting), the resulting cycles are a bit worse than executed in FLASH-ITCM!! Later I found that different CMSIS DSP has HUGE impact on this!

With same code in ITCM-RAM and data in DTCM setting (project target 5 in MDK), if we choose different CMSIS DSP version, the resulting cycles are significantly different! Here are my result:

CMSIS DSP 1.5.2 ==> 106193 cycles

CMSIS DSP 1.7.0 ==> 144520 cycles!! (36% more, what...?!)

====================================================================

So far, I've checked that FPU is really running. Also had a quick exam on map file, make sure all math codes are trulely in 0x0000 0000 ~ 0x0000 3FFFF ITCM region. Map files are also attached with different DSP version. If the memory setting is on ITCM-Flash or AXI-Flash, the results are only slightly different.

====================================================================

This is really wired, does any F7 or DSP expert know what's going on here?

Really thanks for reading this question,

Zt.

Zt Liu · ‎2019-07-30

Hey, after taking a further look on the map file, I just found that the when using newer CMSIS DSP call a sqrtf function which lies in FLASH-ITCM region, not ITCM!

    0x0020212c   0x0020212c   0x0000003a   Code   RO         2750    i.__hardfp_sqrtf    m_wm.l(sqrtf.o)

If I edit the sct file, so that this sqrt to be executed in ITCM, the newer map file is

    0x00000da8   0x00202ca8   0x0000003a   Code   RO         2741    i.__hardfp_sqrtf    m_wm.l(sqrtf.o)

The resulting cycles drops to about 125057, still about 17% inefficient.

I've also noticed that the older CMSIS DSP doesn't call this __hardfp_sqrtf routine, but only some inline FP operations.

__hardfp_sqrtf in CMSIS DSP 1.7.0

0x00000DA8 B510      PUSH          {r4,lr}
0x00000DAA ED2D8B02  VPUSH.64      {d8-d8}
0x00000DAE EEB18AC0  VSQRT.F32     s16,s0
0x00000DB2 EE180A10  VMOV          r0,s16
0x00000DB6 F0204000  BIC           r0,r0,#0x80000000
0x00000DBA F1C040FF  RSB           r0,r0,#0x7F800000
0x00000DBE 0FC0      LSRS          r0,r0,#31
0x00000DC0 D00A      BEQ           0x00000DD8
0x00000DC2 EE100A10  VMOV          r0,s0
0x00000DC6 F0204000  BIC           r0,r0,#0x80000000
0x00000DCA F1C040FF  RSB           r0,r0,#0x7F800000
0x00000DCE 0FC0      LSRS          r0,r0,#31
0x00000DD0 BF04      ITT           EQ
0x00000DD2 2001      MOVS          r0,#0x01
0x00000DD4 F200FB00  BL.W          __set_errno (0x002013D8)
0x00000DD8 EEB00A48  VMOV.F32      s0,s16
0x00000DDC ECBD8B02  VPOP.64       {d8-d8}
0x00000DE0 BD10      POP           {r4,pc}
0x00000DE2 0000      MOVS          r0,r0

So probably this is the crucial part casing the different performances.

Any ideas?

.zt

Amel NASRI · ‎2019-08-16

Hello @Community member ,

To get rid of this issue, you need to add the sqrtf.o to your execution region in the scatter file.

For example, in M7-ITCM_rwDTCM.sct:

  EXEC_REGION_ITCM 0x00000000 0x10000  
  { 
    utilities.o (+RO)
    stm32h7xx_it.o (+RO)	
    main.o (+RO-CODE)
    arm_bitreversal2.o (+RO-CODE)
    arm_cfft_f32.o (+RO-CODE)
    arm_cfft_radix8_f32.o (+RO-CODE)
    arm_cmplx_mag_f32.o (+RO-CODE)
    arm_max_f32.o  (+RO-CODE)
    sqrtf.o (+RO-CODE)		
	
    ; Place also const data in ITCM-RAM. 
    main.o (+RO-DATA)
    arm_common_tables.o (+RO-DATA)
    arm_const_structs.o (+RO-DATA)
  }

Please test this, and let me know if the number of cycles is reduced, as expected.

thanks,

Amel

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.