Predicting the ART Accelerator Flash memory read operations on the STM32F405RGT

Patrick G · ‎2021-12-03

I have set up my FLASH_ACR to disable all caches or prefetch options, but with 5 WAIT_CYCLES. My goal is to predict when the read operations happen and how long they take in processor cycles.

I provided some example code below, but my goal is to get a general understanding, not just for this specific example.

f000 fdb7     bl <entry>
f004 0101    and.w  r1, r4, #1
1e4b           subs  r3, r1, #1
426c           negs  r4, r5
4249           negs  r1, r1
ea01 0005   and.w  r0, r1, r5
401c           ands  r4, r3
4044           eors  r4, r0
f000 fdb4     bl <exit>

According to the docs: "Each Flash memory read operation provides 128 bits from either four instructions of 32 bits or 8 instructions of 16 bits according to the program launched."

My current understanding is that every branch results in a flash read operation. When returning from the first "bl" (line 1) the ART, therefore, reads 128-bit. This would include the instructions on lines 2-7 including line 7. Inbetween line 7 and line 8 another flash read operation would then need to happen.

In total this would cost me an extra 2*5 cycles next to the cycles my 7 instructions (AND.W, SUBS, NEGS, NEGS, AND.W, ANDS, EORS) take. This result is unexpected as it means that I spend over half of my cycles on flash read operations.

waclawek.jan · ‎2021-12-03

> When returning from the first "bl" (line 1) the ART, therefore, reads 128-bit. This would include the instructions on lines 2-7 including line 7.

While RM does not says it, it's unlikely that the 128-bit read is on an arbitrary address, it will most probably be a natural row read; i.e. one FLASH read would probably cover all those instructions only if first of them is 128-bit aligned.

> This result is unexpected as it means that I spend over half of my cycles on flash read operations.

Why would that be unexpected? Your wishes won't defeat reality. It's exactly the purpose of prefetch to avoid this.

Besides prefetch on the FLASH interface (which you can control) there's also prefetch in the processor, plus its execution pipeline. Both make timing analysis more complex. There's very little benefit in making this sort of analysis, as there are literally dozens of circumstancial influences, which are almost impossible to predict.

JW

Patrick G · ‎2021-12-06

> While RM does not says it, it's unlikely that the 128-bit read is on an arbitrary address, it will most probably be a natural row read; i.e. one FLASH read would probably cover all those instructions only if first of them is 128-bit aligned.

Do you know where I can find more accurate information on the ART accelerator? I search for it on st.com and the web, but I could not find anything.

> Besides prefetch on the FLASH interface (which you can control) there's also prefetch in the processor, plus its execution pipeline. Both make timing analysis more complex. There's very little benefit in making this sort of analysis, as there are literally dozens of circumstantial influences, which are almost impossible to predict.

I know that timing analysis is very complex and often treated as black art just like compiler design. The difference is that a processor operates very deterministically under some assumptions. My use-case for the timing analysis is very specific and lays in the topic of power analysis.

I deactivated all caches and prefetch operations I could find. Please let me know if anyone found a good configuration for the STM32F405RGT to do a timing analysis.

Also, my code is set up to not include any branches except the entry and end. The code is typically only around 10 lines of assembly code. For this very limited scope, it should be possible to predict the behaviour of the processor, I am just not 100% sure how to.

waclawek.jan · ‎2021-12-06

> Do you know where I can find more accurate information on the ART accelerator?

Probably nowhere (except in internal ST design documents).

> The difference is that a processor operates very deterministically under some assumptions.

Processor operates entirely deterministically, this is a fully synchronous design.

The problem is, it's a hellishly complex deterministic synchronous design. There are too many inputs to the equation and there are too many circumstances and they are too complex. In other words, there are no simple answers, and the complex answers are too expensive to be provided generally.

I'm sure ST is absolutely willing to fire up their extremely expensive simulators to provide you the exact answers if you have exact enough questions, would you provide enough incentive to them, as expressed in $M++ in purchases.

Sorry, that's just that it is what it is. Try to accept it.

JW