cancel
Showing results for 
Search instead for 
Did you mean: 

STM32F7, QUADSPI and the data cache - odd behaviour

andy
Associate II
Posted on June 12, 2018 at 12:41

Hi all,

I have a board using the STM32F765, which uses the QSPI interface to fetch data from an FPGA. The normal sequence of operation is:

  • A timer in the FPGA causes it to perform a sequence of events which produce a chunk of data, which is stored in internal FPGA memory
  • FPGA asserts an output indicating to the STM32 that data is available
  • This assertion causes an EXTI interrupt, which kicks off a chain of events which use the QSPI interface to read data from the FPGA. (The FPGA is programmed to emulate a serial Flash memory).
  • The QSPI interface is set to indirect read mode, and DMA2 is used to read data from the QSPI FIFO and write it into standard (not DTCM) RAM.
  • Once each DMA transfer is complete, the ISR cleans and invalidates the relevant portions of the data cache (by address).
  • There are usually several blocks of data to read at a time. Once they've all been read, the FPGA is reset (via a separate SPI interface), a flag is set to indicate to the main application that a block of data is available to process.
  • The main application processes the data, then sits and waits for the next block, and so on.

The problem I'm seeing is that, just occasionally, the QSPI chip select goes active after all the data has been read from the FPGA. The next time the QSPI interface is used, its status register indicates that it's busy and has a full FIFO, as if a read operation has been started, but no DMA has been set up to actually put the results somewhere.

I've spent the last day or so using a scope and some GPIO signals to determine what is happening when, and here's where it gets really interesting. I now know that the spurious QSPI activation is not caused by the code which intentionally initiates QSPI reads. Instead, three conditions must be met in order to cause it:

  • 1) The main code must be actually accessing the data from the FPGA, which is in cacheable RAM (SRAM1);
  • 2) The data cache must be enabled;
  • 3) The SysTick interrupt handler must have just exited.

The SysTick handler is very simple; usually it just sets a few flags and increments some counters, and occasionally it generates some debug output (though this has no effect on whether the spurious QSPI event is triggered). Nevertheless, QSPI CS goes low within a few nanoseconds of the handler exiting.

If I turn the data cache off, then all is well and there are no spurious QSPI events.

If I leave it on while data is being fetched from the FPGA, but turn it off while processing, then that's OK too.

Calling SCB_CleanInvalidateDCache() at the end of the SysTick handler makes no difference.

Putting the data from the FPGA into DTCM RAM does fix the symptoms, but I don't know why.

My ISR normally leaves the QSPI interface enabled. If I turn QSPI off by writing 0 to the control register once it's finished with, then this does prevent spurious QSPI transactions from occurring - but since I don't know why they're occurring in the first place, I can't be sure there isn't something else bad also happening for the same reason.

I never get a spurious QSPI event when the main loop is sitting waiting for new data to arrive; only while it's actually working on that data. This is actually quite a short window of time; if I turn off hardware floating point support, which makes the processing take longer, then spurious QSPI events can occur within a wider time window after each block is received.

So, I have a few workarounds for the spurious QSPI events: turn off the data cache (at least while the data is being processed), move the data into DTCM, or turn off the QSPI interface when it's not being used. None of these really explain the problem, though they do make the symptom go away.

My best guess is that exiting the SysTick handler while the cache contains data from SRAM1 is causing a number of cache operations to occur, and one of these is writing to QUADSPI->AR, or triggering the QSPI interface in some other way.

It *almost* feels like some obscure erratum, ie. 'QSPI interface can be triggered by data cache operations on return from interrupts', but I'd rather fix my code than blame it on something that's 'clearly' a hardware bug that nobody else seems to have noticed!

Any suggestions please, experts?

32 REPLIES 32
Christensen.Tyler
Associate II

The best I've come up with is turning off caching and disabling all read/write access in a 256MB memory block starting at 0x90000000. The theory is that if memory-mapped QSPI is disabled, it is illegal to access this memory space but by default MPU configuration the core speculatively caches when a number in that memory range happens to appear in certain registers resulting in an invalid condition in the QSPI peripheral.

It seems to have fixed it for me, but so did half a dozen other "fake fixes" including adding random NOP's so only time will tell if this is a long term fix.

I reproduced this on a discover board and have had a support ticket open for 2.5 months hoping to get it explained. They don't even respond anymore when I ask for a schedule update so I don't think ST really has much interest in looking into this :(

>>I don't think ST really has much interest in looking into this

These are the sort of failures that should be red-flagged and investigated. @STOne-32​ 

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

If you're not normally writing, lock down and trap writes with the MPU

MPU_InitStruct.AccessPermission = MPU_REGION_PRIV_RO; // or MPU_REGION_PRIV_RO_URO

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
andy
Associate II

Thankfully I was able to put this project on the back burner for a while and didn't see that anyone had posted follow-ups - but purely by coincidence I have picked the job up again recently and remembered that there was this horrible cache problem that I still have to resolve one way or another...

I know it's absolutely no help to anyone, but It's incredibly comforting to find I'm not the only one, and that the problem is one that other engineers have also encountered and struggled to explain.

Also, kudos to anyone who read through my original post in its entirety; often I find that writing out a full description of a problem is enough to show me the answer - but not this time, sadly.

A few thoughts...

  • turning off QSPI when it's not in use probably isn't a bad idea
  • it's not a guaranteed fix, though, because if a spurious QSPI read can occur when executing code that doesn't use QSPI, then presumably it can also occur when the CPU *is* executing code that uses QSPI
  • the fact that other people have also had issues specifically related to QSPI gives me some confidence that QSPI may be the ONLY thing which is breaking - and that's reassuring. I can probably perform some additional checks before initiating a QSPI transfer, and presumably, once the QSPI interface is actually in use, nothing in the CPU can fiddle with it speculatively. Transfers which are started successfully should finish successfully.
  • More importantly, there probably isn't some other horrible memory corruption going on elsewhere for the same reason, which isn't going to come back and bite me eventually.

I do have another product using the same CPU but no QSPI, and although they share a lot of common code, the other product has never shown any odd unexplained symptoms. The 765 has forced me to learn a lot about how caches are used, and how their presence has to be taken into account (especially when moving data around with DMA).

"I can probably perform some additional checks before initiating a QSPI transfer, and presumably, once the QSPI interface is actually in use, nothing in the CPU can fiddle with it speculatively. Transfers which are started successfully should finish successfully."

I spent a while trying to do hack fixes like this and ultimately failed to make it 100% reliable. Basically you can't possibly turn the peripheral off fast enough to prevent spurious transmissions. I'd run into issues where I'd do something like setup a flash chip for a page-read, then perform the page read, but between the setup and read a spurious chip-reset might get sent out that no matter how fast I am, I can't possibly prevent from going out. Then the page read reads from the wrong destination address and the system fails.

Essentially there might be some conceivable way to use peripheral disabling to accomplish success but I found it to be just about impossible when really analyzed.

What if you put your QSPI driver code into DTCM RAM, so the cache isn't needed?

Can you wait around while QSPI transfers take place? If so, your DTCM code could:

  • enable QSPI
  • initialise the QSPI transfer
  • wait until it's complete
  • disable QSPI
  • return to main code in Flash

If the cache isn't being used at any time while the QSPI interface is enabled, would that stop it causing spurious reads?

JMund
Associate II
  • More importantly, there probably isn't some other horrible memory corruption going on elsewhere for the same reason, which isn't going to come back and bite me eventually.

I hate to break it too you but we have experienced memory corruption going on elsewhere. We will try with all of the caches disabled...

It's possible, that I'm not sure of. My system is a low-jitter high bandwidth realtime control loop so I don't really have the ability to wait around for the transfer to take place, all sorts of other loops and interrupts must have a much higher priority than downloading/uploading to the flash chip. If you basically locked up the CPU from doing anything except your highly controlled QSPI code from start to finish, I imagine you could get that working well (possibly by trial-and-error on exactly what that code contains, but once it works just don't touch it).

andy
Associate II

Are you using DMA? Is the corrupted data something which has been moved that way?

I had quite a few problems with corrupt or missing data, which turned out to have been moved into the cache by a DMA controller but never flushed to RAM. It was a straightforward fix (clean / invalidata data cache before accessing data that has been moved via DMA) once I knew what the problem was, though a few (more) of my hairs turned grey in the process.

Christensen.Tyler
Associate II

I partially use DMA (used for large data streams but not for flash management, address changes, etc. where each message is just a few bytes), but it's not the problem. I can disable all the DMA driven aspects and I still see things on a scope going out on the QSPI bus that I never requested be transmitted.