2024-08-30 11:12 AM - last edited on 2024-09-03 03:20 AM by Andrew Neil
I am developing a product which has an STM32H743 device. I am designing the firmware for the first PCBA prototype. The product receives 30fps video signal from an 720i analog video decoder through DCMI (every second bytes are stored, all lines, so 360x240 resolution frames are captured) to external SDRAM. The stm32 does an image conversion and sending the converted data from internal RAM to a second processing unit through parallel bus. The second processing unit displays the data. The parallel bus is a 16bit interface, with 10MHz clock (20Mbyte data rate with ~44% duty cycle – 44% reading, 56% idle). And in the during operation, the device captures two channel audio signals as well.
The ideal operation flow is the following:
The whole process must be in synchronization, if there is a pending something or too much calculation time, frames will be skipped and the ideal operation will be not kept.
And the issues that I observing:
All the parallel bus lines has a 120ohm series resistor now.
Some additional info:
I think I have some memory bandwidth issues. There must be some wait cycles when the CPU tries to write to internal SDRAM. Or when DCMI/DMA1 writes to external SDRAM and the TIM5 triggers data transfer with DMA2 to GPIO port happens in the same time causes latency.
I am thinking about the following options:
Is there anything else to improve the acquisition? Do you see any issue with the concept of the ideal operation implementation?
2024-08-30 11:40 AM
Seems to be a problem with the cache management - how you do this ?
(you didnt write about...)
2024-08-30 12:04 PM
Thanks for the comment. Indeed I do nothing with cache management. Basically in previous projects I used STM32F4xx, L4xx controllers and cache was not a thing there. In STM32CubeIDE I just clicked the "magic" button enable D-CACHE and thats all what I do.
Do you have any great docu about this?
Any advice for best approach for my use case?
What I am unsure, how the data is moving in the internal bus. So when it is writing data to external SDRAM, when it is writing to internal SDRAM and is there any collusion, wait cycles which could be optimized... It would be great to see a measure how this is happening in my use case.
Thanks!
2024-08-30 01:59 PM - edited 2024-08-30 02:00 PM
Ok, just think about...the D-cache keeps data from cpu, but if data is changed ( by DMA ) , it still has old data.
You have to "tell" him, to refresh data...
So make a picture/diagram, what data is changed by dma or else, than cpu - and when its used and has to be real/new data, because dma is sending it to ...somewhere.
For cache management you have :
- SCB_InvalidateDCache_by_Addr(..) -> delete cache , because now old data
- SCB_CleanDCache_by_Addr(...) -> write cache to mem, because cache/cpu has new data and needs write out to update memory
+
all addresses you use for cache management have to be aligned to match the cache access -> like this :
uint8_t inbuf[4096*8] __attribute__ (aligned (32));
2024-09-02 06:43 AM - edited 2024-09-02 06:45 AM
Thanks, I will examine this in more detail.
Do you think the question #3 is related? ("f I increase the clock rate of the parallel bus, from 10MHz to 20MHz the timings are “crashed”")
Based on my examination with scope and logic analizer, seems like the STM32 could not keep up the handling of the input clock as trigger. So each clock should update the GPIOB port. At clock rate of 10MHz it is working but with 20MHz it does not. And I think 20MHz clock is not that high speed. Of course, considering the lot of stuff running in the background, the cache issue could be related, but let me know your thoughs.
2024-09-02 06:57 AM - edited 2024-09-02 07:04 AM
Do you have any great docu about this?
I invite you to have a look at these application notes:
AN4839 Level 1 cache on STM32F7 Series and STM32H7 Series
AN4838 Introduction to memory protection unit management on STM32 MCUs
AN4891 STM32H72x, STM32H73x, and single-core STM32H74x/75x system architecture and performance
AN4861 Introduction to LCD-TFT display controller (LTDC) on STM32 MCUs Especially the sections 5.5 Graphic performance optimization and 5.6 Special recommendations for Cortex-M7 (STM32F7/H7)
2024-09-02 07:00 AM
Do you use an OS?
Make sure that DMA has enough time for bus access, so let the CPU sleep whenever possible.
Without using an OS, I made the mistake not having a "sleep state" in my main state machine, which made the CPU constantly and always check some peripherals and variables, although it was absolutely not needed.
Hardware / 20 MHz:
- the 120 Ohms seem a little high, maybe the flanks are not steep enough for some IO
- have you set all GPIOs to highest speed possible?
2024-09-02 07:01 AM
... but first start with some cache management!
I have no ideas about that, though...
2024-09-02 07:09 AM - edited 2024-09-02 07:12 AM
Thanks the input. Makes lot of sense. Indeed, I am not using OS and I have an always running while loop. I will try to add some sleep, it can fit there.
For series resistor, first I used 27R and I changed to 120R that I read from other forums. Basically from datasheet I saw 5pF input capacitance, so with 120R the cut frequency is pretty high still.
For all GPIOB the output speed is very high as they are configured as digital output. For clock input, I am not sure if I can set high speed for timer input. In GPIO config of cubeide I see this:
I do not know if it make sense to change the "Maximum output speed". The signal is input. Will check this.
And I have this timer config:
2024-09-02 07:27 AM
The speed register settings only apply to outputs. But mind that the data lines to a memory are usually bi-directional.
120R: It's not only about the RC-lowpass corner frequency, this is also about flank steepness.