STM32H743: strange interactions LTDC, USB, SDRAM, optimisation

PMath.4 · ‎2020-05-21

I've programmed a "retro" computer using the H743.

https://www.thebackshed.com/forum/ViewTopic.php?FID=16&TID=12105

Development environment is CubeIDE.

The code supports a USB keyboard and VGA output using the LTDC and a RGB565 R2R DAC. It can play MP3 and FLAC audio over the on-chip DACs and runs a Basic interpreter. Until recently I have always compiled with O2 set.

I have had a longstanding issue with very minor instability of the LTDC clocking which could cause flicker and in extreme cases temporary loss of synch in the monitor. This was almost completely solved by using a crystal oscillator rather than a crystal to clock the processor. Loss of sync was most pronounced when using ARGB4444 in 800x600 resolution with the framebuffer in SDRAM and a 60Hz frame rate, running a FPU intensive Basic program, and whilst decoding mp3 and outputting using 44100 timer interrupts per second (yes I know I should be using DMA).

I've also been having issues with cache coherency in some code when pointers to pointers were used to update SDRAM. Whilst playing with solutions to the latter problem I switched to optimisation "Ofast". This didn't affect my caching issue but the loss of synch disappeared. Unfortunately the interpreted Basic programs ran about 5% slower with the "fast" optimisation. I can live with that so I decided to stick with Ofast.

However, a side affect was that enumeration of the USB keyboard became much less robust. Enumeration on power up stopped working and enumeration on insertion only worked occasionally. Once enumerated the keyboard worked correctly.

I then started playing with optimisation changes at a source code level and found that by compiling stm32h7xx_ll_usb.c with "O2" USB functionality returned to normal.

All this takes a huge amount of time and I'm certainly not going to wade through a disassembler listing to understand the differences. however, my learnings are as follows:

SDRAM and/or LTDC works better when compilation is with Ofast

USB must have stm32h7xx_ll_usb.c compiled with "O2"

A crystal oscillator provides a better clock source than a crystal

waclawek.jan · ‎2020-05-22

> I don't have experience with the host USB code.

Lucky you! ;)

JW

PMath.4 · ‎2020-05-22

"Cache coherency is usually an issue only when DMA is involved."

Certainly should be but I am absolutely convinced there is an issue in normal code when layers of indirection are used and the RAM is managed by the FMC. This feels like a silicon bug but I haven't got the time or energy to try and produce it in a simple enough way to post a formal errata.

It often seems to be associated with longjmp.

I've also found that in many places in the code I have had to declare variables volatile when there is absolutely no reason why this should be required. The latter is probably a compiler bug rather than silicon but it all makes development more tedious

waclawek.jan · ‎2020-05-22

> longjmp

Do you absolutely and perfectly understand the implications of setjmp/longjmp?

If not, get rid of it. If yes, you are probably wrong.

JW

PMath.4 · ‎2020-05-22

I do understand setjmp/longjmp. As I say I haven't the time or energy to try and isolate the issue but it goes something like this

global variable *p points to an area in SDRAM

write to *p typically this would be something like an error message when the Basic interpreter hits a syntax problem

longjmp back to start of program just after H/W initialisation

print *p

WTF - that wasn't what I stored

to fix cleanDcache before longjmp

berendi · ‎2020-05-22

This has nothing to do with the cache. Try a memory barrier:

asm volatile("":::"memory");

instead of cleaning the cache. Cache management functions incidentally make this a lot, but the barrier alone should fix your issue without messing with the cache.

Or just declare the pointer as volatile.

PMath.4 · ‎2020-05-22

Sorry don't agree. First p isn't volatile and nor is the RAM it points to AND the value of the pointer is correct when checked after the jmp - the issue is that either the RAM should have been written and hasn't or the cache should give me the data. If I use precisely the same code with p pointing to internal RAM there is no issue.

berendi · ‎2020-05-22

Does it work with asm volatile("":::"memory"); instead of CleanDCache or not?

> If I use precisely the same code with p pointing to internal RAM there is no issue.

Which internal ram? AXI, AHB or DTCM? Does it work in all three?

PMath.4 · ‎2020-05-22

The problem isn't sufficiently reproducible to answer that, any little change can cause the problem to appear or go away. In any case where would I put the barrier - just before the longjmp? Surely even GCC wouldn't change instruction order around that statement.

The point of the post was not to look for solutions but to indicate the level of interconnectedness between various aspects of hardware, optimisation and code in a large and reasonably complex program in order to warn others who may be experiencing similar.

I understand that many of the experts here have a complete aversion to HAL and CubeMX but for fast program generation they are essential tools. I have no idea why video is more stable with Ofast than O2 but the effect is solid and completely reproducible. Likewise USB host enumeration is crap with Ofast but solid with O2. All this just makes development unnecessarily hard work. I don't know if ST employees ever read these posts but these sorts of things devalue the product.

berendi · ‎2020-05-22

> The problem isn't sufficiently reproducible to answer that, any little change can cause the problem to appear or go away.

Yet you are insisting that calling CleanDCache before longjmp, or moving a buffer around in memory has fixed the problem for good. I can't follow your reasoning.

> to warn others who may be experiencing similar.

Fine. Then the answers are there to warn those people about jumping to false conclusions.

> HAL and CubeMX but for fast program generation they are essential tools

Yes, they can generate lots of code fast, full with race conditions and other bugs. To get production quality code from the mess can take longer than writing it from scratch using proper coding practices.

> I have no idea why video is more stable with Ofast than O2 but the effect is solid and completely reproducible.

But I do: there are some race conditions in the code, code constructs that don't conform to the C standards, or improperly set RCC and FSMC registers.

> All this just makes development unnecessarily hard work.

Agreed. I too would prefer stable and well documented libraries that would pass actual code review and test so that I could concentrate on our own software and hardware bugs.

> I don't know if ST employees ever read these posts but these sorts of things devalue the product.

They sometimes do, but they could not do anything about a post stating that a complex piece of unknown code (not showing a single line) on third party hardware does not work as expected.

waclawek.jan · ‎2020-05-22

> I understand that many of the experts here have a complete aversion to HAL and CubeMX but for fast program generation they are essential tools.

Yes of course they are, but just because there's nothing else.

Would there be clearly written and well documented examples, together with similarly clear and concise documentation for the hardware, you'd be able to develop both rapidly and with understanding. ST - as well as other mcu makers these days - fail to see this, or maybe even they want to lock you to their tools.

At the end of the day, it's your product and your responsibility.

JW