big difference in execution time with data in DTCM, AXI, flash
Hello,
I am working with STM32H7, Nucleo-board and Atollic TrueStudio.
For my project I wrote a small assembler routine to process data (2 arrays of 1400 int16). One run of this routine fetches 3 int16 from each array, so 6 memory accesses in total. The whole routine has about 50 assembler instructions. Now I placed these arrays in different memories and see big differences in execution time:
arrays in flash: 377 µsec
arrays in AXI SRAM: 334 µsec
arrays in DTCM: 154 µsec
CPU clock is 400MHz
I understand, that flash accesses are slow, but I did not think that 6 flash accesses can take as long as 50 assembler instructions.
Regarding AXI vs DTCM I read that all SRAM accesses have 0 waitstates, so I don't understand why AXI is only little faster than flash and much slower than DTCM.
What options do I have? Did I miss some setting?
Here are the instructions that fetch data:
ldrsh r5, [r1, #(offset * 2)]
ldrsh r6, [r1, #(-offset * 2)]
sub r5, r5, r6
add r2, r2, r5
ldrsh r5, [r3, #(offset * 2)]
ldrsh r6, [r3, #(-offset * 2)]
sub r5, r5, r6
add r4, r4, r5
ldrsh r5, [r1], #2
ldrsh r6, [r3], #2 I know that it is not optimal to use data right after fetching them, but I have nothing to put inbetween and it is the same for all three experiments.
Assembler routines run in ITCM, instruction cache in on, data cache is off.
Currently the data is constant but my next step is to fill the array from ADC via DMA, so I guess DTCM is not an option or is there a way to copy to DTCM via DMA?
Thanks for any help and hints
Martin
