SPI DMA fails (only) with ICache enabled, why?

waldmanns · ‎2021-10-02

I have a project which uses SPI DMA, and the SPI DMA appears to fail when I enable the instruction cache (ICache! not DCache), but otherwise appears to work perfectly fine. I do observe this for both a F722ZE and G491RE.

I'm working on this since 3 weeks, and I really can't figure it out nor do I even understand what possibly could be the issue, given it is related to enabling ICache.

more background & details:

I have a realatively complex code (ca 150kB) for a gimbal controller project (STorM32), which came to live ca 8 years ago and which I'm working on since then (= the code I think is quite mature and stable). The code is for a F103RC and was based on SPL. I now recently have ported the code to STM32CubeIDE and HAL/LL and have it running for a several weeks now, and it is working perfectly fine.

However, I have also ported the code to run on Nucleo-F722ZE and Nucleo-G491RE, and here I observe that when I enable the ICache I find that the sensor which is connected to the SPI fails to work, from which I conclude that the issue is related to SPI DMA. This is so for both targets, besides their different system structure.

(in case of F722ZE I call SCB_EnableICache() to enable the ICache, and in case of G491RE I do call __HAL_FLASH_INSTRUCTION_CACHE_DISABLE() to disable ICache)(these are the functions also used/not used by HAL)

When the ICache is disabled, it appears to all work perfectly fine for both platforms. This is so independent on other settings, like DCache enabled or disabled.

The code for all three targets, F103RC, F722ZE, and G491RE is largely identical, except of the hardware layers, which are of course somewhat different for the three.

I do know well about the cache coherency issue (and I have scanned the web well I think). However, in all documents I have seen it has been always only related to DCache, never to ICache, and this obviously make lots of sense.

I certainly don't do any fancy things with the instructions like (re)writing code in flash and calling it or moving code to RAM, or whatever else. All code goes into flash as normal and that's it.

So, I find it hard to even imagine what could go wrong with enabling the ICache, since coherency issues as known for DCache should not be possible.

I'm really clueless and desperate, and any hint would be much appreciated.

Many thx, Olli

Tesla DeLorean · ‎2021-10-02

Fails how? No data comes out, the wrong data comes out? The SPI/DMA simply fail to interact.

What you describe sounds more like a critical timing issue.

Does changing the optimization to the lowest level change the behaviour?

If yes, can you bisect the use of optimization to a function level to isolate probable culprits?

Similarly do fencing operations alter behavior? ie __ISB() _DSB(), etc.

Use of DCache clean to similar effect?

If you build with professional tools like Keil or IAR, do these exhibit the same behaviour?

ART cache settings on the flash/prefetch sides?

Flash wait states? Perhaps max operating speed 24-27 MHz

Things volatile that should be, and compiler honouring those in generated code?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waldmanns · ‎2021-10-02

many thx for your response, and the many hints

unfortunately, some I'm not able to answer or put through. Let me tell what I can say at this point in time:

yes, I too considered critical timing issue, i.e., that the higher frequency/speed may spoil some of my many timings. I however think I have looked at this carefully, at least as much as I could. Delays e.g. are now all done based on a timer (and not just sloppy loops), and so on. So, while I for sure cannot rule this out with 100%, I think it's not the most likiest cause.

I have not looked at optimization effects. I use -Os except of very very few time critical functions where I have -O3. I must admit I have difficulties to understand how that would do the issue. And I'm also not sure what I could/shoudl do if this would show some effect. You suggest to "only" enclose the SPI part in different O levels?

_ISB() _DSB(), etc.: not tried. I never ever used them so far in any of my code(s), so no experience with them. Where shoudl I try them? In teh places there one would do DCache maintenance?

DCache is disabled.

no Keil or IAR at my disposal

ART & prefetch do not have any effect. I have them disabled per default.

the flash wait states are as produced by STM32CubeIDE, and I have checked that they confrom with what the datasheets say. Also, I'm not sure I understand why that should have an effect, the ICache wouldn't read out the flash faster, I would argue.

well, lower operating frequencies/speeds are no option, I mean, even if that would make it work there is no reason to go with F7,G4 if they fall behind F1, right

volatile: I think I did but you are right I better should carefully inspect again, and maybe see what happens if I place some too many

thx again so far, much appreciated

Tesla DeLorean · ‎2021-10-02

Optimization allows for reordering and folding, order of completion and events may be changed subtly vs linear flow one might anticipate. Success without optimization flags a whole class of failure mechanisms around code generation.

Writes to memory are buffered, code execution through the pipeline continues. The design is supposed to guarantee order of completion, not time of completion, this can create some hazards on peripherals clocking at slower speeds,and where internal states might expect a clock, or two, between interactions, or read-back. One of the biggest here being the clearing a interrupt in a peripheral, and the NVIC tail-chaining back into your IRQ Handler

The CM7 is superscaler, fun aspects there with how long it takes to execute instruction(s) and ordering. Although wouldn't explain issues in G4 (CM4)

The ART + Prefetch + Flash, yeah, I'm grasping there, critical paths typically Fault the micro, but the ICache related problem intrigues me.

Optimization/volatile, a lot of compilers are going to re-read everything without optimization, less folding, less holding values in registers.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2021-10-02

Fencing instructions, any place you think what you've done immediately before creates dependencies on what's going to happen next. Stop sign vs Yield sign

ARM eschews the use of transistors for interlocks and pipeline protection, Intel would use a shed-load. Be conscious of potential hazards, the cliffs have no guard rails, their might not be signs..

Peripheral registers are not memory, they are a window on a state-machine or combinational logic, that run in parallel and potentially at a different pace, in some cases not entirely synchronous with the MCU.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waldmanns · ‎2021-10-02

concenring optimization, I think I understand what it does, and I would have argued that the order of peripheral register calls is not changed, so what should it do ... but as you've suggested I did some tests with the optimization level ... and this indeed turns out to have been an excellent suggestion !!!

on the G491RE target, simply changed the global flag through -O0, -O1, -O2, -O3, -Os,- Ofast, i.e., all provided in the STM32CubeIDE, and all failed - except of -O0, for this it does work !!

so, next I wrapped the spi dma part with -O0 ... and this again did not work ...

I've also placed lots of _ISB() _DSB() ... did not help ...

I've ended up adding some small delay to setting the CS signal ... and THIS did the trick !! And it also makes sense, and goes exactly along the line with what you were arguing ... obviously, that's the conclusion, the time for setting CS to the write&read had become too short for these faster MCUs !

Given that I fooled around with this for 3 weeks now, I'm really glad it's solved now, although I also feel a bit stupid now. It obviously was very helpfull to have a second pair of eyes to look at things. MANY THX !!!

Since we are at it, pl let me ask what I can't figure out from the sources I've found:

It's clear to me that for the F7 enabling the DCache requires us to implement any of the suggested measures to counter cache incoherency, and I do see the effects of not doing so in tests. However, for the G4 I have not found any such comment. Moreover, for the G4 the STM32CubeIDE generates code with both ICache and DCache enabled but without any of these DCache coherency code pieces even when DMA is being used. And I also could not yet see any bad effect of enabling DCache without adding data coherency code in my tests. So, all this makes me wonder: For the G4, is it indeed possible to just enable the DCache without any cache coherency issues?

TDK · ‎2021-10-03

Perhaps look at the SPI lines in both cases to narrow down the problem to the master or the slave. Sounds like the slave might need more setup time than you're allowing. I wouldn't assume code is mature/robust just because it's old.

> Moreover, for the G4 the STM32CubeIDE generates code with both ICache and DCache enabled but without any of these DCache coherency code pieces even when DMA is being used.

The CubeMX generated code doesn't actually transmit/receive any data, it just sets things up. You need to manage the cache when you transmit or receive stuff.

> And I also could not yet see any bad effect of enabling DCache without adding data coherency code in my tests. So, all this makes me wonder: For the G4, is it indeed possible to just enable the DCache without any cache coherency issues?

If data cache is enabled, you need to handle it properly. Maybe it will work if you don't, but that doesn't mean it's correct or that bugs won't appear in the future.

If you feel a post has answered your question, please click "Accept as Solution".