STM32F7, QUADSPI and the data cache - odd behaviour

andy · ‎2018-06-12

Posted on June 12, 2018 at 12:41

Hi all,

I have a board using the STM32F765, which uses the QSPI interface to fetch data from an FPGA. The normal sequence of operation is:

A timer in the FPGA causes it to perform a sequence of events which produce a chunk of data, which is stored in internal FPGA memory
FPGA asserts an output indicating to the STM32 that data is available
This assertion causes an EXTI interrupt, which kicks off a chain of events which use the QSPI interface to read data from the FPGA. (The FPGA is programmed to emulate a serial Flash memory).
The QSPI interface is set to indirect read mode, and DMA2 is used to read data from the QSPI FIFO and write it into standard (not DTCM) RAM.
Once each DMA transfer is complete, the ISR cleans and invalidates the relevant portions of the data cache (by address).
There are usually several blocks of data to read at a time. Once they've all been read, the FPGA is reset (via a separate SPI interface), a flag is set to indicate to the main application that a block of data is available to process.
The main application processes the data, then sits and waits for the next block, and so on.

The problem I'm seeing is that, just occasionally, the QSPI chip select goes active after all the data has been read from the FPGA. The next time the QSPI interface is used, its status register indicates that it's busy and has a full FIFO, as if a read operation has been started, but no DMA has been set up to actually put the results somewhere.

I've spent the last day or so using a scope and some GPIO signals to determine what is happening when, and here's where it gets really interesting. I now know that the spurious QSPI activation is not caused by the code which intentionally initiates QSPI reads. Instead, three conditions must be met in order to cause it:

1) The main code must be actually accessing the data from the FPGA, which is in cacheable RAM (SRAM1);
2) The data cache must be enabled;
3) The SysTick interrupt handler must have just exited.

The SysTick handler is very simple; usually it just sets a few flags and increments some counters, and occasionally it generates some debug output (though this has no effect on whether the spurious QSPI event is triggered). Nevertheless, QSPI CS goes low within a few nanoseconds of the handler exiting.

If I turn the data cache off, then all is well and there are no spurious QSPI events.

If I leave it on while data is being fetched from the FPGA, but turn it off while processing, then that's OK too.

Calling SCB_CleanInvalidateDCache() at the end of the SysTick handler makes no difference.

Putting the data from the FPGA into DTCM RAM does fix the symptoms, but I don't know why.

My ISR normally leaves the QSPI interface enabled. If I turn QSPI off by writing 0 to the control register once it's finished with, then this does prevent spurious QSPI transactions from occurring - but since I don't know why they're occurring in the first place, I can't be sure there isn't something else bad also happening for the same reason.

I never get a spurious QSPI event when the main loop is sitting waiting for new data to arrive; only while it's actually working on that data. This is actually quite a short window of time; if I turn off hardware floating point support, which makes the processing take longer, then spurious QSPI events can occur within a wider time window after each block is received.

So, I have a few workarounds for the spurious QSPI events: turn off the data cache (at least while the data is being processed), move the data into DTCM, or turn off the QSPI interface when it's not being used. None of these really explain the problem, though they do make the symptom go away.

My best guess is that exiting the SysTick handler while the cache contains data from SRAM1 is causing a number of cache operations to occur, and one of these is writing to QUADSPI->AR, or triggering the QSPI interface in some other way.

It *almost* feels like some obscure erratum, ie. 'QSPI interface can be triggered by data cache operations on return from interrupts', but I'd rather fix my code than blame it on something that's 'clearly' a hardware bug that nobody else seems to have noticed!

Any suggestions please, experts?

STOne-32 · ‎2019-01-17

Dears,

Thank you for spotting this complex case, we will discuss internally the case if you can send me in private the OLS Ticketnumber for close follow-up with your local contact.

In mean-time, recently by end of december, our Core/Partner - ARM published a new errata at Cortex-M7 cache that may is the root cause ( to be confirmed after case reproduction from our side)

Category A - 1259864

Data corruption in a sequence of Write-Through stores and loads

Workaround

There is no direct workaround for this erratum.

Where possible, Arm recommends that you use the MPU to change the attributes on any Write-Through memory to Write-Back memory. If this is not possible, it might be necessary to disable the cache for sections of code that access Write-Through memory.

We are updating soon our STM32 Errata as well having Cortex-M7 : F7/H7 series. Direct link at our partner web page is here, then login/register - https://silver.arm.com/download/download.tm?pv=4462869&p=1929427

Regards,

STOne-32

andy · ‎2019-01-30

Any update please ST?

Unfortunately the link you've posted to the ARM errata just takes me to a login page. Not useful.

Christensen.Tyler · ‎2019-01-30

After my 6th ignored "can someone please respond to this ticket?" over the last 2.5 months, yesterday someone finally responded saying they thought the issue was resolved. They are now aware the issue has not been resolved. We'll see if that avenue of help goes anywhere.

JMund · ‎2019-01-30

Do you just need the erreta?

to anyone wondering it basically just says there is an issue with the cache and that random data corruption can occour at any time dependent on some internal timing issue... there is no workaround and to disable all caches.. as an additional protection I would also recommend setting up the MPU to disable all access to anything except what you need... I’ve spent months working through this stuff if anyone needs some advise call me I know how frustrating it can be +1 250 256 8363 PST

andy · ‎2019-01-30

Wow, that's... dramatic. So basically the problem is that the cache on M7 is broken?

That sounds a whole lot more serious than I was expecting. I thought it was just something very specific to QSPI on this particular device family, since that's the peripheral which seems to be showing problems.

Guess I'll just turn the cache off and be done with it. Some official comment from ST, even if it's only an updated errata document for the 765, would be nice, though.

JMund · ‎2019-01-30

Yes cache is screwed... fixed in new silicon but who knows how long that will take before ST starts shipping updated chips..

andy · ‎2019-01-30

Thank you, but I'm not sure I see how this specific erratum results in spurious activity on the QSPI bus.

I interpret the symptom as "CPU may read data which is out-of-date", which clearly is bad news and well worth knowing, but it doesn't explain reads occurring to locations or devices which the code isn't accessing.

Christensen.Tyler · ‎2019-01-30

I agree, I see no real reason to believe this cache errata entry is at all related to this QSPI bug.

JMund · ‎2019-01-30

In my case a pointer was corrupted which was causing a read to out of bounds area... ARM is downplaying the significance of this bug I believe... from the sounds of it though you may be encountering something else entirely... after we fixed the caching stuff we have no trouble hitting very high up times... out of 50 devices tested we have had not a single crash or bus error in the 12 days of up time since the incident (600 days cumulative)

CHead · ‎2019-01-30

The ARM erratum only talks about write-through areas, and it only says that the data read back after a write may be incorrect. So I wouldn’t say the cache is useless; it’s still useful for RAM areas (which are write-back instead of write-through) and read-only areas. It also sounds like it’s not related to the QUADSPI issue.

As for the QUADSPI issue, the CPU core is allowed to make speculative reads to any accessible, normal memory area whenever it wants (and 0x90000000, the QUADSPI memory mapped area, is of normal type in the default memory map). One might expect that, when the QUADSPI peripheral is not configured in memory mapped mode, such accesses ought to be ignored, but it seems that’s not the case. I just tried setting the address register to a nonzero value, reading from 0x90000000, and then reading back the address register, with the QUADSPI peripheral not even enabled (never mind set to memory-mapped mode). As expected the read from 0x90000000 faulted, but the address register was zeroed as a side effect! If such a read were generated speculatively, this would perfectly explain the observed behaviour. I’d suggest trying using the MPU to make the 0x90000000 region no-access when not using memory-mapped mode; according to the Cortex-M7 technical reference manual, speculative reads should not be generated to no-access memory regions. I’d be interested to hear if this helps solve anyone else’s issue.