Performing mathematics on data in SRAM while DMA is coping data from DCMI at maximum speed.

Linas L · ‎2022-05-21

Hello, I am making a sensor that uses a camera, and I would like to perform a simple mathematical operation on image data while it is still reading out.

My algorithm is very linear and does not need all data, so in theory, I should be able to do that.

(in FPGA it is extraordinary simple to do in real time)

My main concern is that I will be loading AHB bus while DMA is copying data from DCMI to SRAM. If I set DMA with highest priority, as far as I understand I could get DMA overrun, since ARM core access priority is larger than DMA?

Original idea is to use HSYNC interrupt to count lines, and when I get a new line copied, I could start to do mathematical operations to that line in SRAM, while DMA will be copying the second line. I also get a bit of horizontal blanking time in which DMA is idling.

My Cortex-M33 will be running at 160MHz, and camera will be running at 60MHz

(theoretical maximum is 64MHz (Frequency ratio DCMI_PIXCLK/f HCLK = 0.4). I am also running 10b of data, meaning DMA FIFO will be capturing 2 pixels for a single DMA transfer, effective frequency will be 32MHz. (word is 32b, and 10b is half a word, so 2x packing)

Any advice on how I can make this work ? Hardware is still under way, so I have no way of testing how it works, and if I get corrupted data.

waclawek.jan · ‎2022-05-21

Which STM32?

> If I set DMA with highest priority, as far as I understand I could get DMA overrun, since ARM core access priority is larger than DMA?

Where do you have that information from?

There is not much information about the details of arbitration between the busmasters in STM32 bus matrices, but all the information available indicates that in AHB bus matrices it's usually simple round-robin arbitration, i.e. all busmasters stand the same chance to access a single slave bus.

Maybe you want to read AN5593. You can also benchmark on available Nucleo boards, imitating the camera by generating clocks using timers.

I'm not sure you will be able to perform any reasonable computation on a continuous stream of images. As you've said, this is a task for an FPGA.

JW

Tesla DeLorean · ‎2022-05-21

I don't think its a priority issue as much as it s a bandwidth/saturation issue.

Doing things at wire speed would seem to suggest a FPGA/CPLD is still a better solution.

Are you able to drop the frame rate, or use a FIFO memory?

Check if your STM32 has different SRAM banks with independent bus matrix plumbing, perhaps you can ping-pong between different area using the DMA "double buffer" modes.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Linas L · ‎2022-05-21

Hello,

Thank you for replay

In case of concurrent accesses from the CPU and the GPDMA, the bus matrix arbitration rules the access to the SRAM1. If the last access is from the CPU, during the next access, the GPDMA wins the bus, and accesses SRAM1. After the CPU can again access SRAM1.

Based on this it looks like if I read data to registers from memory, and perform operation without SRAM usage, GDMA will get priority and write data to memory. So while camera is pumping data to SRAM, i need to have sufficient NOP's operations allowing GDMA to do it's job, and after last byte from camera, i should jump into high performance mode without any NOP's

In this situation, nop's needs to be tailored to bus load so I would never receive overrun.

Is this ok or stupid thinking ?

Linas L · ‎2022-05-21

Hi,

Where is no space for any other chip, only processor. and more time I keep processor running, more I will spend on quiescent current.

In this device, i need to rump up speed, enable camera, capture frame, do mathematics as fast as possible, and when I am done shut down everything and drop below 1MHz to conserve energy.

Lattice does make FPGA's tat would work, with very small energy footprint, but again, it is literally processor (UFBGA) on one side, camera on other, and barely some space for power delivery.

So best I can do is to use horizontal blanking time for calculation without any effects on GDMA.

(unless reading HSYNC pin or status in DCMI register generates traffic on bus ? and this is just as bad as reading data from SRAM ?)