Help with odd STM32F765BI QSPI behavior? We're seeing what looks like an extra or spurious read transaction.

MNels · ‎2019-01-02

We're seeing what looks like an extra or spurious read from the QSPI block on our STM32F765BI. As yet, we're not able to pinpoint the cause and could use some ideas of where else to look. Sorry that this gets a bit long.

Background:

We're using the QSPI interface in Quad mode with a 1Gb Micron MT25QL01GB part.
We're using HAL with our own EEPROM and NVM code modules to create storage for configuration data and log files.
There's no OS, but we've written the NVM and EEPROM layers to be non-blocking during a block erase.
Reads are always blocking.
Each foreground loop we check that the NVM, EEPROM, HAL_QSPI, code and QSPI block are all coherent: either idle and OK or Busy with an erase.
Reads and writes are from a few bytes to < 16kB.
We're using the indirect access mode, not interrupt or DMA.
The log files are by far the bulk of our NVM accesses. The sensor log reads/writes 1280 bytes at a time and has 100MB of dedicated space.
Writes have never been a problem.
About 1:10,000,000 reads to the sensor log will cause a problem.

The problem:

All requested read transfers complete successfully and we don't see any data corruption or issues.
After the read completes the QSPI block bits show: not BUSY, TCF is high, no errors and 0 data in the FIFO (FLEVEL = 0).
About 1:10M reads when we come back around in the foreground, the NVM, EEPROM, and HAL_QSPI code status is idle and OK, but the QSPI block indicates it is busy with a read. The FIFO is full and the read is suspended waiting for room in the FIFO.

Diagnostics:

We didn't see anything in the Errata that looked like this. Close, but not exactly.
Using the debugger we can consume data from the FIFO (each display update while viewing the QSPI registers consumes data from the FIFO). Eventually the FLEVEL count counts down 20 -> 1C -> 18 ... 8 -> 4 -> 0. At 0 bytes in the FIFO the TCF bit goes high and the BUSY bit clears.
We don't clear out the QSPI registers after a read. It seems to be repeating the last read - which already finished successfully.
We created a MPU region for the QSPI registers and enabled/disabled it as needed to access the QSPI block. It never hit and we still see the extra read. The QSPI block is already full and blocked when we come back in to NVM.
ARM Trace doesn't show any function calls that would access the QSPI block from the previous successful completion to detecting that the QSPI block is in the middle of a read.
We do have some DMA running, but it doesn't look like we're getting hosed down with a wild DMA. All of the configuration, size, read command, etc. register settings and data appear to be just as the last read.

Work Arounds:

We're able to prevent the issue by one of two things.
First is to disable/enable the QSPI block only when we're using it. We essentially substituted the MPU enable/disable for setting/clearing the QSPI enable bit.
Second is to clear out the last read register settings. By issuing a 1-byte read to the chip status register we don't see the problem anymore.
The problems and work arounds replicate on multiple boards.

Analysis:

All appearances, it's like the QSPI block thinks it's getting a write strobe and then it repeats the last read.
We don't think it's coming from a wild FW or DMA access, but....
In theory, we think the chip SR read should also be susceptible to a spurious read and while it won't block on the FIFO full, should still leave an extra byte in the FIFO. We'd catch and flag this as an error, but all seems happy.

Our best guess at this time is that a specific set of register conditions makes the QSPI block susceptible to a glitch on the write line. We're not stuck, but would like to root cause the problem. We'd appreciate other ideas and techniques that can help us further identify the problem.

Thanks!!

Mike Nelson

waclawek.jan · ‎2019-01-03

I don't use the 'F7 nor the QSPI module, but some random, probably not helpful, thoughts:

any spurious reads to the memory-mapped area (even if memory-mapped mode is not enabled)?
which erratum you think is close?
any other flags in status register set, maybe surprisingly
any timeout (hardware or software) which might interrupt an onging read and bypass the "busy clear" test at the end of read you've mentioned
try to change the writes-to-reads ratio artificially, does the problem occurence follow this?
what exactly "by issuing a 1-byte read to the chip status register" involve?
log all the commands issued to QSPI module in a circular buffer in RAM to see what was the sequence of events just before the problem occured
observe the spurious data in FIFO to find a pattern/relationship to the last commands
instead of the MPU guard, enable/disable an interrupt on FIFO threshold (perhaps set low), and try to observe that

JW

MNels · ‎2019-01-04

Hi JW,

Thanks for the response. There are some really good ideas in your list. As a follow up, we've implemented the 'read chip status register' work around and downloaded 40GB worth of log files (ran all night) with out a hiccup.

Point by point..,.

I hadn't considered the memory mapped behavior as we're not using it. It would be pretty easy to configure the MPU to check that for wild FW accesses.
Errata for the STM32F7[45]xx chips:
- SPI 2.11.2 "BSY bit may stay high at the end of a data transfer in Slave mode"
- QSPI 2.4.1 "Extra data written in the FIFO at the end of a read transfer"
There is no register bit corruption or changes that we can find.
Using trace, we're able to see from the code flow that the prior reads have all completed successfully. We also don't see any data corruption of the read data when this happens.
Changing the write/read mix is an interesting idea. We've changed the number of writes for other reasons, but not for this problem specifically.
For our EEPROM chip, the command to read its status register is 0x05. It performs a 1 byte read to the chip controller, not to its memory area (i.e. it doesn't need an address).
This is a good idea, we've just been checking against the immediate prior read.
We haven't tried this either other than to check for overall data consistency.
We considered this, but hadn't tried it. We're also considering hooking up a logic analyzer and triggering a GPIO on detecting the error to see what happens when the read starts up.

Thanks again for the ideas. Considering that this only happens on reads and not writes, reads are rare, service related events and not a customer facing issue as writes would be. Plus now we have some confidence in the work around.

I'll keep working on this on the side just to see if I can further identify where it's coming from. The biggest clue right now is that the main difference from the memory read to the status register read is the lack of an address. That kind of points to the address register as the potential source of the issue.

Kind regards

Mike Nelson

Amel NASRI · ‎2019-01-21

Hello @Community member ,

Thanks for sharing these details about issues you are facing and workarounds you applied.

For your information, and for other members who may face issues with QSPI on Cortex-M7 devices, there are 2 things that should be considered:

Suitable configuration for the MPU
Recently published errata by our Core/Partner ARM: Data corruption in a sequence of Write-Through stores and loads (https://silver.arm.com/download/download.tm?pv=4462869&p=1929427)

Regarding the first possible reason: some speculative read accesses may cause high latency or system errors when performed on external memories like SDRAM or Quad-SPI. This is Cortex-M7 issue and the workaround is to properly configure MPU.

You find more "Special recommendations for Cortex®-M7 (STM32F7 Series)" in a dedicated section with this title in AN4861 (https://www.st.com/resource/en/application_note/dm00287603.pdf).

You have also a dedicated application note for QSPI (AN4760) entitled "Quad-SPI (QSPI) interface on STM32 microcontrollers". It is available from https://www.st.com/resource/en/application_note/dm00227538.pdf.

Regarding the second possible root cause: Where possible, Arm recommends that you use the MPU to change the attributes on any Write-Through memory to Write-Back memory. If this is not possible, it might be necessary to disable the cache for sections of code that access Write-Through memory.

Hope this bring some help.

-Amel

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

MNels · ‎2019-01-23

Hi Amel,

Thanks for the pointers. As I read back through my original post, it's definitely not as clear as it seemed when I wrote it. A summary of the problem would be:

An EEPROM read restarts without an explicit FW write to the QSPI block registers.

or

A phantom write strobe to the QSPI block registers appears to restart an EEPROM read.

That said, I don't know that the Data Corruption errata (#1259864) is a close match. We're not seeing corruption in the read data, but that an unexpected EEPROM read happens and we can't account for its origin. The EEPROM read is a side effect of something else gone wrong.

We're using the default MPU settings for QSPI register space. It is device, so it shouldn't have caching turned on. JW mentioned accesses to the QSPI memory mapped region. We're not using memory mapped mode, but I've not tried to trap wild accesses to that area to see if that's what's launching the extraneous read.

Thanks for the QSPI application note. I didn't have that one and I'll look through it. In general though we don't have any cached access to the QSPI so I don't know if it's applicable.

Thanks again

Mike Nelson

CHead · ‎2019-02-01

Hi Mike,

The memory mapped region is Normal memory type in the default memory map. That means even if you don’t actually access it, it could be prefetched speculatively by the CPU. Setting the memory mapped region to no access through the MPU will not generate any faults in this case, but it will prevent the speculative accesses. So bear in mind, even if you don’t see actual faults happening, making the memory mapped region no-access may actually fix the problem.

MNels · ‎2019-02-07

Hi Chead1...,

I can't begin to express enough thanks for your comments.

I've worked with ARM and ST for quite a few years and I'll have to admit your note put a new wrinkle in my forehead. When JW first mentioned this I tried blocking the memory mapped region with the MPU, but did it as a diagnostic for wild FW writes and not as a solution. Then when it didn't cause a memory fault I discarded the region settings. I should have paid better attention, but I did not know that the MPU would block a speculative fetch without causing a fault.

So this appears to solve the problem: turning off access to the QSPI memory mapped region stops the spurious QSPI reads. I've implemented this as a root cause solution and removed the other work around (enable/disable the QSPI block) we were using.

This seems like a QSPI block bug to me. If we're not configured to use memory mapped mode, I don't think memory mapped region accesses, speculative least of all, should initiate a QSPI transaction. At a minimum, it should be added to the QSPI Application Note to warn others of the side effects.

Thanks again to all for your help. I hope at some point to be able to de-wrinkle my forehead over this one. :):)

Kind regards

Mike

Amel NASRI · ‎2019-02-08

Hello @Community member ,

I confirm that QSPI application note will be updated in order to add recommendations on how to properly configure MPU.

-Amel

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

MNels · ‎2019-02-08

Thanks Amel!!

Mike