CCM-RAM performs worse than FLASH because of PC-relative loads

Albi G. · ‎2020-12-07

Hi guys,

I am trying to optimize interrupt latency and execution speed on an STM32G474.

The natural thing is to use the CCM-RAM for what it is intended... while it potentially lowers the interrupt latency, my code actually performs worsewhen executed from CCM-RAM:

Here is why:
What you see is that GCC has stored all needed constants behind its code in the text section.

You can also see the dooming

ldr r3 [pc, #44]

Such instructions will cause a pipeline issue when this function is placed in CCM-RAM, because as far as I understand, the flash, with its accelerator, has two busses (for instruciton and data) seperated, but when operating from CCM ram, insturctions and data go through one bus only which is sub-optimal (stalls either data or instruction fetch).

To the correct way to go here would be to have the actual code in CCM-RAM but the local constants need to be in SRAM.

I cannot do that manually because.. well actually it is just really inconvenient.

Does anyone know how to configure GCC so that local constant dont end up in .text but in .data or something (.xyz) ?

Overall my interrupt has 11 pc-relative loads - and when executed from CCM-RAM, the interrupt is around those 11 clocks slower. ***** :(

Andreas Bolsch · ‎2020-12-07

As your problem is similar as for PCROP you might find this useful:

https://community.st.com/s/question/0D50X00009XkXPZSA3/gcc-and-pcrop

waclawek.jan · ‎2020-12-07

> CCM-RAM

Which address? And where are data and stack located?

I see some function calls there. Eliminate them.

JW

Albi G. · ‎2020-12-08

@Andreas Bolsch , that was kind of what i was looking for - i know those options existed, i have read about them some time ago but could:nt find the resources anymore. In case the site ever goes offline or link invalidates: "-mpure-code" or "-mslow-flash-data"

The CCM-RAM-code placement is now equal or maybe even one cycle faster than the flash version. That is overall some progress.

But i still think this is not as good as it could be. But i honestly dont know. Lets discuss:

Now from a throughput stand point, creating one 4 byte value by two 4byte instructions seems wasteful. Anyway, dont care about code-size now. Fact of the matter is the value stands complete after 2 cycles.

On the other hand, a load instruction has also 2 cycles until the result is ready (from SRAM). However multiple loads are pipelined, so multiple ldr-insturcitons or even one ldrm instruction would be even faster. [(1+N) cycles for N constants].

The requirement would be that this constant-pool need to be in SRAM.. somewhere in the .data-section for example (or in its own section).

So what i imagine as optimal would be (pseudo):

movw rx, [ptr_to_constant_pool]
movt rx, [ptr_to_constant_pool]
ldrm  rx, {r1, r2, r3}

Any idea how to get to this point?

The constants need obviously their own section.. the pointer to it must be hard-coded and the actual data must be moved to the section.

Edit: i just moved all constants i could to memory and that helped again. Now all constants that are left are register addresses. Dont know how to solve that :(

@waclawek.jan: you kind of missed the point. But you have legitimate concern: having stack in CCM-RAM would be equally as fatal to performance. I took care of that.. stack is in SRAM. Function calls are inconsequential: worst case is that external functions have the same performance (aside from a target address load maybe). Anyway, any function calls are not in the hot execution path.