2021-07-16 07:33 PM
Hello, I have a STM32H747-DISCO Discovery kit. I am still fairly new to STM32 so forgive my ignorance but I was wondering what is the most efficient way (in terms of cpu usage) to perform a picture perfect image upscale on a bitmap image (see picture below).
From the link below it seems that it might be possible to offload some of the resizing work onto the DMA2D/Chrom-art accelerator but I am having trouble analyzing the source code.
https://www.electro-tech-online.com/articles/stm32f429-dma2d-bilinear-bitmap-resize.735/
Anyone have anymore examples, information or alternative solutions on this problem?
2021-07-16 08:04 PM
For context, I am creating gameboy emulator and my current 3x upscale implementation of writing to my buffer 9 times for each pixel is a bit too slow and I that's before I've implemented any sound emulation.
2021-07-17 04:21 AM
That's a very neat trick, that used in the linked article, but I'm not sure it will help in your particular case. It's quite computationally extensive to do the blending needed for the random-zoom-factor algorithm in software, thus in comparison offloading it to the DMA2D is super efficient; if you just want to plainly double the pixels, I'm not sure it will make such a spectacular difference.
So, that algorithm does two things: (bi)linear color interpolation based on blending, and "stretching"/"contracting" in space. Let's just concentrate on the latter, and let's just simplify the problem to the particular 2x2 stretch.
The idea is based on the fact that the DMA2D can skip a number of pixels after it has reached end of line, and this skip (in ST's parlance, "offset") is different for source (DMA2D_FGOR.LO, the same for background but we don't discuss that as we don't need that if we don't do the blending) and destination (DMA2D_OOR.LO). This can provide a stretch in one dimension, so we have to do this in two steps.
Let's assume we have a 3x2 image and want to stretch it 2x2 into a 6x4 one:
abc -> aabbcc
def aabbcc
ddeeff
ddeeff
which in memory is (as simplified example; in reality both can be part of a larger image/framebuffer, so individual lines would be separated further away):
abcdef -> aabbccaabbccddeeffddeeff
Let's start with the vertical stretch (into an intermediate buffer). We'll transfer 2 lines, line length of 3, source skip 0 and destination skip 3, after one run
abc -> abc
def ...
def
...
and after second run, where the starting destination address is incremented by 3:
abc -> abc
def abc
def
def
The horizontal stretch is somewhat trickier. We will proceed column-by-column, one DMA2D transfer is just one column, so for the whole image we will need to repeat it that many times as we have columns.
So let's start with the first column, i.e. we set 4 lines, line length 1, source skip 3 and destination skip 6:
abc -> a.....
abc a.....
def d.....
def d.....
repeat with the same settings and same source starting address, with incremented destination starting address:
abc -> aa....
abc aa....
def dd....
def dd....
now increment both source and destination starting address:
abc -> aab...
abc aab...
def dde...
def dde...
etc. rinse and repeat.
As the horizontal stretch is less efficient (both for the need for more DMA2D transfers, and also individual transfers are less efficient as the FIFO won't be utilized), in reality you want to perform it first.
JW