2024-08-30 11:12 AM - last edited on 2024-09-03 3:20 AM by Andrew Neil
I am developing a product which has an STM32H743 device. I am designing the firmware for the first PCBA prototype. The product receives 30fps video signal from an 720i analog video decoder through DCMI (every second bytes are stored, all lines, so 360x240 resolution frames are captured) to external SDRAM. The stm32 does an image conversion and sending the converted data from internal RAM to a second processing unit through parallel bus. The second processing unit displays the data. The parallel bus is a 16bit interface, with 10MHz clock (20Mbyte data rate with ~44% duty cycle – 44% reading, 56% idle). And in the during operation, the device captures two channel audio signals as well.
The ideal operation flow is the following:
The whole process must be in synchronization, if there is a pending something or too much calculation time, frames will be skipped and the ideal operation will be not kept.
And the issues that I observing:
All the parallel bus lines has a 120ohm series resistor now.
Some additional info:
I think I have some memory bandwidth issues. There must be some wait cycles when the CPU tries to write to internal SDRAM. Or when DCMI/DMA1 writes to external SDRAM and the TIM5 triggers data transfer with DMA2 to GPIO port happens in the same time causes latency.
I am thinking about the following options:
Is there anything else to improve the acquisition? Do you see any issue with the concept of the ideal operation implementation?
2024-09-03 12:44 AM - edited 2024-09-03 12:58 AM
Seems like the SCB_CleanDCache_by_Addr() solved the cache issue with the DMA buffer. However I am a bit confused with the MPU and ram configuration.
I am allocating a big array as:
#define FRAME_Y_BYTE_SIZE (94800UL)
__attribute__((section(".my_buffer_section"))) uint8_t gSendBuffer[4][FRAME_Y_BYTE_SIZE];
So currently I am using 379 200 bytes here. And it can fit to a 512kB AXI SRAM.
And I modified the .ld file. I have these MEMORY specification:
/* Specify the memory areas */
MEMORY
{
RAM_EXEC (xrw) : ORIGIN = 0x24000000, LENGTH = 512K
DTCMRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
RAM_D2 (xrw) : ORIGIN = 0x30000000, LENGTH = 288K
RAM_D3 (xrw) : ORIGIN = 0x38000000, LENGTH = 64K
ITCMRAM (xrw) : ORIGIN = 0x00000000, LENGTH = 64K
}
Basically, this is default. But what I modified, I moved all sections to the RAM_D2. E.g. isr_vector, program code, other data, rodata .. etc. And I defined a section for the gSendBuffer:
/* Define the custom section */
.my_buffer_section :
{
. = ALIGN(4); /* Align to 4-byte boundary */
KEEP(*(.my_buffer_section))
. = ALIGN(4); /* Align to 4-byte boundary */
} > RAM_EXEC
So basically the RAM_EXEC is used only for the gSendBuffer.
Then I configured the MPU (as it is decribed in the section MPU setting DMA RAM buffer in the MPU usage in STM32 with ARM Cortex M7 ST document)::
void MPU_Config(void)
{
MPU_Region_InitTypeDef MPU_InitStruct = {0};
/* Disables the MPU */
HAL_MPU_Disable();
/** Initializes and configures the Region and the memory to be protected
*/
MPU_InitStruct.Enable = MPU_REGION_ENABLE;
MPU_InitStruct.Number = MPU_REGION_NUMBER0;
MPU_InitStruct.BaseAddress = 0x2400000;
MPU_InitStruct.Size = MPU_REGION_SIZE_512KB;
MPU_InitStruct.SubRegionDisable = 0x0;
MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;
HAL_MPU_ConfigRegion(&MPU_InitStruct);
/* Enables the MPU */
HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);
}
So here the full axi sdram is excluded from cache if I understand well. So considering these thing, I should not need the SCB_CleanDCache_by_Addr() funciton to clean the cache, because the gSendBuffer should be non-cached. But without the function I had isssues and with the funciton I did not.
I guess I still has some configuration issue with MPU or loader.
In the linker map file I see this:
.my_buffer_section
0x2400016c 0x5c940 load address 0x08018960
.my_buffer_section
0x2400016c 0x5c940 ./Core/Src/global.o
0x2400016c gSendBuffer
0x2405caac . = ALIGN (0x4)
.bss 0x2405cab0 0xedac load address 0x080752a0
0x2405cab0 _sbss = .
0x2405cab0 __bss_start__ = _sbss
Do you see any potential issue with my configuration?
------------------
Update:
There is a typo in the MPU guarded baseaddress... I used 0x2400000 instead of 0x24000000. So missed a zero... Now it works as it intended. However, I got worst performance on image conversion, so I think I will disable MPU on this and use the SCB_CleanDCache_by_Addr() instead.
2024-09-03 5:50 AM
Seems like the issue is not related to the series resistor, but I replaced it to 10R. So here is the thing, I could manage to increase the parallel bus speed to 25MHz from 10Mhz, but it works only, if the audio capture is disabled. Without audio capture, the video stream runs smoothly. When I enable the audio capture, the DMA stream for parallel out is corrupted somehow... Both peripheral uses the DMA2 (Stream0 and Stream1). The DMA streams are configured as:
The TIM5_CH1 is the parallel interface clock input, it is configured for very high priority. The analog capture is on low priority. On the following picture you can see a "normal operation". CH9 is the interesting on, it shows the interrupt of the DMA2 Stream0, on CH5 you can see the parallel reading clock. Between two DMA interrupts we have ~1.895ms time.
And here it is the corrupted one:
The third transfer does not finish properly, and the last dma interrupt occours at ~2.95ms instead of 1.89ms. But the first arrived in time. And you can see the impulse train on the paralel clock are "doubled" in each reading cycle, because the MCU did not signalized the end of the parallel communication and so the external controller started a new reading cycle.
I checked plenty of things in the code already but I feel like I lost in the woods...It seems like the DMA2 Stream0 and Stream1 got into trouble somehow. Or there is some bandwidth issues on the internal buses. If I change back the paralell read speed to 10MHz then everything works properly.
Any idea to look for?
2025-10-10 11:09 PM
Hi @rob-bits
I'm running into a similar issue with the H743 and the FMC bus. Were you able to achieve better performance?
thanks
Matt
2025-10-11 10:25 AM
Hello Matt,
Yeah, I would say I found a compromised solution. With plenty of debugging and usage of logical analizer and some tricks.
The timing of the sdram is vital, the configuration of the fmc interface should follow the sdram datasheet recommendations. The cache configuration also important and invalidation of data before transmitting with dma. The synronization of processes/peripherals are also took a major role in this. E. G. If I transfer something high speed with dma from internal sdram to gpio but in the main time I processing something from fmc and running another high-speed dma for dmci interface that could introduce some lag.
Basically currently I could implement the core functionality that I needed with some compromises.
What issue are you facing?
Regards,
Robert
2025-10-11 11:24 AM - edited 2025-10-11 11:33 AM
Hi @rob-bits
I'm still debugging on my side, but first... my setup...
1- FMC clock is set at 100MHz (max - according to datasheet/errata)
2- SDRAM1 seems to be running (32bit wide),
3- I have an FPGA configured as a SRAM (32bit wide) - with a 13bit address space.
4- While debugging, I'm not accessing the SDRAM, and I removed the DMA out of the equation, and I'm just using the processor to do the transfer.
while (c) {
*(uint32_t *)0x64000010) = i;
c--;
}5- This code will max out with a transfer rate of 10M transfers / sec or (40MBytes/sec) (regardless of the SRAM timing parameters in the FMC controller.
6- The strange thing is... if I modify the code to:
while (c) {
*(uint64_t *)0x64000010) = i;
c--;
}then I get a transfer rate of 20M transfers/sec or (80MBytes/sec) - on the scope I see a burst of 2x32bits for every chip select with the chip select period still at 10MHz
7- I don't think the C compiler supports 128bit data, so instead in the FMC SRAM settings, I changed the data path from 32 to 16bits, and ran the same code. This time, I got a transfer rate of 40M transfers/sec but because it is a 16bit data path (80MBytes/sec) - on the scope I see a burst of 4x16 bit for every chip select with the chip select period still at 10MHz.
I can't seem to get the FMC to run faster than the 10MHz chip select period, or > than 80MBytes/sec (either using a 32 bit data path or 16bit data path).