cancel
Showing results for 
Search instead for 
Did you mean: 

memset() execution slower at some addresses

DApo.1
Associate II

Hello,

After some investigation was found that memset has different behavior executed from different places in flash. Data and instruction cache are off! Micro used is stm32h743xi.

Function is called with following arguments --> memset(dummy, 0, 64)

Its execution time is ~5us when function is placed at:

..., 0x8040c34, 0x8040c54, 0x8040c74, ...

Its execution time is ~1us when function is placed at:

..., 0x8040c3c, 0x8040c44, 0x8040c4c , 0x8040c5c, 0x8040c64, 0x8040c6c ...

Any ideas?

Thanks

10 REPLIES 10
TDK
Guru

Where is dummy located?

How are you measuring execution time?

If you feel a post has answered your question, please click "Accept as Solution".
DApo.1
Associate II

dummy is located in ram

Time is measured via free running timer used as a clock. Its CNT is taken before and after execution.

KnarfB
Principal III

How do you "place" a function? gcc has built-in versions of memset etc. and may decide to inline/unroll its implementation for small values of size. Take a look at the assembler code.

DApo.1
Associate II

This is the asm code, It is the original gcc code for byte memset, nothing strange. It is the same no matter where the memset is placed:

08040c54:  add    r2, r0

08040c56:  mov    r3, r0

08040c58:  cmp    r3, r2

08040c5a:  bne.n  0x8040c5e <memset+10>

08040c5c:  bx     lr

08040c5e:  strb.w r1, [r3], #1

08040c62:  b.n    0x8040c58 <memset+4>

To manipulate the address of memset i just add dummy code somewhere else.

Interesting is that addresses where the execution is slower are +0x20 from each other

> Data and instruction cash are off! 

It's spelled "cache" and I don't believe they are off.

Which RAM? How are clocks and FLASH latency set?

Try 64000 bytes.

JW

DApo.1
Associate II

>>It's spelled "cache"

Thx for the spelling, sorry for the mistake, corrected in the description.

>> I don't believe they are off.

SCB-> CCR = 0x40200, read before memset call clearly show that both the caches are off. If you mean something else pls specify more detailed.

>>Which RAM?

dummy as locally defined array, defined in AXI-SRAM. But its address is the same in both cases, so i do not see why this matters

>>Try 64000 bytes

Here are the measurements for different sizes:

bytes   slow   fast

B    us    us

64    5.11    0.99

640    45.4    5.22

6400    120.88   48.42

64000   220.72   152.74

>> clocks

Clock config could be seen in the attached picture, but it is the same in both the cases => i can not get your point. It looks to me that this could not be the reason.

KnarfB
Principal III

Haven't used a H7 but Cortex-M7 has a quite complex micro architecture (6-stage pipeline, dual-instruction issue). You could use DWT counters to get more info about what's going on. The ratios between you figures are varying alot, hmm.

>>> I don't believe they are off.

> SCB-> CCR = 0x40200, read before memset call clearly show that both the caches are off. If you mean something else pls specify more detailed.

No, I meant this.

Okay, so this may be the more complex case (as compared to instruction cache being switched on). As there's no caching, the processor requests each instruction word from FLASH. Instructions are 16-bit wide and go through a 6-stage pipeline to the asymmetric two-core execution unit (as KnarfB noted above), processor fetch is probably 32-bit wide (I am lazy to look it up), it goes through the 64-bit axi bus to the FLASH controller. FLASH is 256-bits wide (qword) and is accessed through a 3-qword read queue, see FLASH read operations/Read operation overview in RM. Add branch prediction to the mix. Detailed behaviour of all the components mentioned above is simply undocumented.

I would expect that the behaviour would vary at any position within the 256-bit window; and also depend on previous execution state, short sequence making this more pronounced - exactly as you've experienced:

>Here are the measurements for different sizes:

>bytes   slow   fast

>B   us   us

>64   5.11   0.99

>640   45.4   5.22

>6400   120.88  48.42

>64000   220.72   152.74

where you can see, that not only the relative difference decreases but also execution time per transferred word decreases too, with increasing number of loops.

>>>Which RAM?

>dummy as locally defined array, defined in AXI-SRAM. But its address is the same in both cases, so i do not see why this matters

Writing to RAM goes throught the AXI matrix too. That's by no means a passive interconnection; it's a beast, again poorly documented. It may well be that writes are delayed for some reason (e.g. to group them together into dwords) and that then slows down subsequent writes and that this is somehow dependent on the relative phase between various involved clocks; but this of course is just a pure speculation and doesn't sound to be the main reason for the difference here.

The "optically" high processor speed brings higher number crunching capabilities, but the real-time-control aspect generally stays the same as it used to be.

In Cortex-M7, generally, the TCM buses (and associated memories) are intended to bring down the uncertainties/jitter; but they of course have their own set of issues, and the uncertainties inherent in the processor (stemming from dual-issue, branch prediction etc.) remain.

Welcome to the wonderful world of 32-bit mcus.

JW

Just a note, this change

>6400   120.88  48.42

>64000   220.72   152.74

is suspiciously low - hasn't the timing overflown there?

JW