2026-03-02 2:44 AM
Hi everyone,
I’m currently working on a camera preview pipeline on the STM32MP257F-DK using the Sony IMX335 (MIPI CSI-2) sensor.
One important point is that STM32MP2 uses a media-controller-based camera architecture (DCMIPP + CSI subdevices). Because of this, Qt Multimedia cannot directly detect the MIPI camera as a standard /dev/videoX capture device.
For that reason, I am using libcamera as the capture backend. It correctly handles the media graph configuration internally and exposes a usable video stream to GStreamer, which makes camera streaming stable and reliable.
The functional pipeline is working. However, the goal is to display the live feed inside a Qt6 QML application with proper GPU acceleration, and this is where I am facing a major performance bottleneck: CPU usage is very high during preview, while the Vivante GPU remains mostly idle.
I’d really appreciate guidance from anyone who has implemented a zero-copy GPU camera pipeline on STM32MP2.
Board: STM32MP257F-DK
SoC: STM32MP257 (Cortex-A35 + Vivante GC7000L GPU)
Camera: Sony IMX335 (5MP, MIPI CSI-2)
ISP: DCMIPP
OS: OpenSTLinux 6.0 (Scarthgap)
Kernel: 6.6.48
Qt: 6.6.3 (QML / QtMultimedia)
GStreamer: 1.22.12
libcamera: 0.3.0
Display: Wayland + EGL
My goal is simple:
Show live camera preview in Qt6 QML with the GPU doing the heavy work — not the CPU.
Ideally:
No CPU pixel format conversion
No memcpy per frame
No CPU → GPU texture upload copies
DMA-BUF zero-copy from camera to GPU
Using libcamera, the following pipeline works:
libcamerasrc
→ video/x-raw,format=RGB16,width=1280,height=720,framerate=25/1
→ videoconvert
→ video/x-raw,format=BGRA
→ appsink
→ QVideoSink
→ QML VideoOutput
Preview is stable at 25 fps, 1280x720.
So functionally everything is correct.
CPU usage is between 60% and 75%, just for preview.
At the same time:
GPU usage is around 5%
GPU is almost idle
This clearly means the camera path is CPU-bound.
After profiling, I see:
RGB16 → BGRA conversion (videoconvert) consumes significant CPU
In appsink, frames are copied multiple times
Qt uploads texture from CPU to GPU
Around 250+ MB/s of memory bandwidth is being used for nothing but copying pixels
So even though we have a GPU (Vivante GC7000L), almost the entire pipeline is CPU-based.
The hardware clearly supports:
DCMIPP ISP
DMA
Vivante GPU with EGL
Wayland + OpenGL ES
Architecturally, this should be possible:
IMX335
→ DCMIPP ISP
→ libcamerasrc (DMA-BUF)
→ glupload (EGL import)
→ glcolorconvert (shader)
→ qmlglsink
→ QML scene graph
This would keep frames on the GPU from capture to display.
But currently I cannot reach this architecture.
The correct solution seems to be:
libcamerasrc → glupload → glcolorconvert → qmlglsinkHowever:
gst-inspect-1.0 qmlglsink
→ No such element
It seems the Qt6 GStreamer plugin from gst-plugins-bad is not packaged in OpenSTLinux 6.0.
Is there an official ST package or Yocto recipe for this?
There is no:
/dev/dma_heap/Kernel config options appear missing:
CONFIG_DMABUF_HEAPS
CONFIG_DMABUF_HEAPS_SYSTEM
CONFIG_DMABUF_HEAPS_CMA
Without DMA-BUF heap support, true zero-copy EGL import may not be possible.
Is this intentionally disabled in STM32MP2 BSP?
libcamera outputs RGB16 (RGB565) efficiently.
But Qt6 QVideoFrameFormat does not support RGB565 anymore.
So I am forced to convert to 32-bit (BGRA/RGBx) before sending to QVideoSink.
That conversion alone costs a lot of CPU.
Is there a recommended Qt6-based approach on STM32MP2 to avoid this conversion?
Has anyone successfully implemented:
libcamera
Qt6 QML
GPU-accelerated preview
Zero-copy DMA-BUF path
on STM32MP257 or STM32MP2 family?
Solved! Go to Solution.
2026-04-16 3:37 AM
Hi Yassine_behilil,
Yes — I now have a working solution on STM32MP2 with libcamera and OpenGL.
Instead of going through GStreamer, I implemented a direct pipeline using DMA-BUF and EGL. The current flow is:
libcamera → DMA-BUF (FD) → EGLImage (EGL_LINUX_DMA_BUF_EXT) → GL texture (GL_TEXTURE_EXTERNAL_OES) → Qt/OpenGL rendering
In this approach:
Frames are never copied to CPU (zero-copy path)
I use libcamera requestCompleted() to get the DMA FD
The FD is passed to the UI thread via a callback
EGLImage is created once per FD and reused (buffer pool)
glEGLImageTargetTexture2DOES is used to bind it to a texture
This significantly reduces CPU usage.
However, there are some important constraints:
Format must be GPU/EGL compatible (only RGB565 works)
Proper thread handling is required (UI thread must do EGL/GL work)
EGLImage creation must be cached per FD to avoid overhead.
2026-03-03 10:31 PM
With neon accelerator it will reduce 10 to 15% cpu usage
2026-03-10 11:47 PM
Waiting for Reply..
2026-04-15 7:17 AM
Hi Dhanakrishna,
Did you finally find a real solution for this issue?
I am very interested because I am facing almost the same problem on STM32MP2: the camera preview works, but the CPU still does most of the work while the Vivante GPU stays mostly idle.
Were you able to make the pipeline GPU-accelerated, especially with DMA-BUF zero-copy, EGL import, glupload/glcolorconvert, or qmlglsink?
In other words, did you manage to reach something close to:
libcamerasrc -> glupload -> glcolorconvert -> qmlglsink -> QML
Or did you find another working architecture to reduce CPU load?
Any update would be very useful.
Thanks.
2026-04-16 3:37 AM
Hi Yassine_behilil,
Yes — I now have a working solution on STM32MP2 with libcamera and OpenGL.
Instead of going through GStreamer, I implemented a direct pipeline using DMA-BUF and EGL. The current flow is:
libcamera → DMA-BUF (FD) → EGLImage (EGL_LINUX_DMA_BUF_EXT) → GL texture (GL_TEXTURE_EXTERNAL_OES) → Qt/OpenGL rendering
In this approach:
Frames are never copied to CPU (zero-copy path)
I use libcamera requestCompleted() to get the DMA FD
The FD is passed to the UI thread via a callback
EGLImage is created once per FD and reused (buffer pool)
glEGLImageTargetTexture2DOES is used to bind it to a texture
This significantly reduces CPU usage.
However, there are some important constraints:
Format must be GPU/EGL compatible (only RGB565 works)
Proper thread handling is required (UI thread must do EGL/GL work)
EGLImage creation must be cached per FD to avoid overhead.