big difference in execution time with data in DTCM, AXI, flash

Mr_M_from_G · ‎2018-12-12

Hello,

I am working with STM32H7, Nucleo-board and Atollic TrueStudio.

For my project I wrote a small assembler routine to process data (2 arrays of 1400 int16). One run of this routine fetches 3 int16 from each array, so 6 memory accesses in total. The whole routine has about 50 assembler instructions. Now I placed these arrays in different memories and see big differences in execution time:

arrays in flash: 377 µsec

arrays in AXI SRAM: 334 µsec

arrays in DTCM: 154 µsec

CPU clock is 400MHz

I understand, that flash accesses are slow, but I did not think that 6 flash accesses can take as long as 50 assembler instructions.

Regarding AXI vs DTCM I read that all SRAM accesses have 0 waitstates, so I don't understand why AXI is only little faster than flash and much slower than DTCM.

What options do I have? Did I miss some setting?

Here are the instructions that fetch data:

  ldrsh   r5, [r1, #(offset  * 2)] 
  ldrsh   r6, [r1, #(-offset * 2)] 
  sub     r5, r5, r6
  add     r2, r2, r5
  ldrsh   r5, [r3, #(offset * 2)] 
  ldrsh   r6, [r3, #(-offset * 2)]
  sub     r5, r5, r6
  add     r4, r4, r5
  ldrsh   r5, [r1], #2 
  ldrsh   r6, [r3], #2

I know that it is not optimal to use data right after fetching them, but I have nothing to put inbetween and it is the same for all three experiments.

Assembler routines run in ITCM, instruction cache in on, data cache is off.

Currently the data is constant but my next step is to fill the array from ADC via DMA, so I guess DTCM is not an option or is there a way to copy to DTCM via DMA?

Thanks for any help and hints

Martin

Uwe Bonnes · ‎2018-12-12

You setup is not clear. Where do these *** µsec come from? At what speed? Is it 2 (arrays) * 1400 the assembler sniplet + control structures? AXI ram access probably needs one wait state to cross the AXI domain, so double the duration of the DTCM access seems sensible. Flash access is cached so some penalty against AXI SRAM access seem sensible. So the relation of the execution times seem sensible.

B.t.w., look at the Cortex-M7 single instruction, multiple data instruction to speed up your task .

Mr_M_from_G · ‎2018-12-12

Hello Uwe,

sorry I forgot to mention CPU clk, it is 400 MHz, I added it in my first post.

And yes, it is 2 arrays of 1400 int16 each. r1 points into array1 and r3 points into array2, I fetch 3 values with a distance of offset from each array to compute some result values.

I have two IT blocks and a branch in my assembler routine, so of the 50 instructions i mentioned about 40 are executed per run. This is 1 / 400MHz * 40 instr/run * 1400 runs = 140nsec as a rough estimation of the minimum execution time (there is a loop for the 1400 runs around the asm routine and some instructions are two cycles...).

So 154µsec with DTCM is a good value. But 334 µsec is 180 µsec plus = 128 nsec / run = 51 clock cycles / run = 8.5 clock cycles per memory access for the only reason that data is in AXI SRAM instead of DTCM (for flash it is 10.6 clock cycles more per memory access compared to DTCM).

Meanwhile I found that I may potentially be possible to use MDMA to copy from ADC to DTCM. I would appreciate some application note or example code (without HAL please) for MDMA. There are some parts of the ref manual that are not clear to me, precisely the role DMA1 is playing when setting up a transfer with MDMA. This is mentioned several times in the MDMA chapter.

Thanks again for any help and hints

Martin

Singh.Harjit · ‎2018-12-12

I don't know if you have the cache turned on for the AXI SRAM. If you do, you should look at the data access pattern so that a read of one item, pulls in other items.

For example, you might want to do a bunch loads together and then use the data but without dependencies. This way, the fetching of the loads can overlap the use of the data. Think about organizing your data so that it fits on cache line boundaries.

Try writing your code in C and compiling it with optimization on and looking at the assembly to get some ideas. These days, compilers are super clever and account for microarchitecture considerations.

Mr_M_from_G · ‎2018-12-13

Hello Singh.Harjit

thanks for your answer.

Please can you explain closer how to setup for caching AXI SRAM. Currently I have only instruction cache enabled, I simply use SCB_EnableICache (); Do you mean to use also SCB_EnableDCache ?

I asked a few questions about cache and MPU before, there is also some things not clear to me:

https://community.st.com/s/question/0D50X0000A1lyJ6SQI/please-give-some-help-on-l1-cache-and-mpu

Yes, I have a number of ideas to further speed up my code and also I derived my asm code from an optimized compiled C code.

Still I think it is worth to take a look at these access times and I wonder if this is normal for STM32H7 which would mean that any truely random access (using a lookup table, processing comminication data...) to any other memory than DTCM RAM will pause execution for 8 to 10 clock cycles. This seems to go deep into hardware architecture which it not documented in ref manual (at least I didn't find it up to now). I would appreciate a statement of one of the STM32 gurus here in the forum.

Thanks for any help

Martin

Singh.Harjit · ‎2018-12-13

Please search this site for how to enable the data cache.

Also, take a look at ST application note: AN4838 and AN5001.

Mr_M_from_G · ‎2019-01-31

Hello,

after I collected some experience my current conclusion is:

If you want it reliably and determinably fast, put your code in ITCM and your data in DTCM.

I'lb be glad to hear some other's experience.

Thanks for your help

Martin