cancel
Showing results for 
Search instead for 
Did you mean: 

MIPI Camera GPU Acceleration on STM32MP257F-DK – High CPU Load, GPU Mostly Idle

Dhanakrishna_Chaitanya
Associate II

 

Hi everyone,

I’m currently working on a camera preview pipeline on the STM32MP257F-DK using the Sony IMX335 (MIPI CSI-2) sensor.

One important point is that STM32MP2 uses a media-controller-based camera architecture (DCMIPP + CSI subdevices). Because of this, Qt Multimedia cannot directly detect the MIPI camera as a standard /dev/videoX capture device. 

For that reason, I am using libcamera as the capture backend. It correctly handles the media graph configuration internally and exposes a usable video stream to GStreamer, which makes camera streaming stable and reliable.

The functional pipeline is working. However, the goal is to display the live feed inside a Qt6 QML application with proper GPU acceleration, and this is where I am facing a major performance bottleneck: CPU usage is very high during preview, while the Vivante GPU remains mostly idle.

I’d really appreciate guidance from anyone who has implemented a zero-copy GPU camera pipeline on STM32MP2.


Platform Overview

  • Board: STM32MP257F-DK

  • SoC: STM32MP257 (Cortex-A35 + Vivante GC7000L GPU)

  • Camera: Sony IMX335 (5MP, MIPI CSI-2)

  • ISP: DCMIPP

  • OS: OpenSTLinux 6.0 (Scarthgap)

  • Kernel: 6.6.48

  • Qt: 6.6.3 (QML / QtMultimedia)

  • GStreamer: 1.22.12

  • libcamera: 0.3.0

  • Display: Wayland + EGL


What I’m Trying to Achieve

My goal is simple:

Show live camera preview in Qt6 QML with the GPU doing the heavy work — not the CPU.

Ideally:

  • No CPU pixel format conversion

  • No memcpy per frame

  • No CPU → GPU texture upload copies

  • DMA-BUF zero-copy from camera to GPU


What Is Currently Working

Using libcamera, the following pipeline works:

libcamerasrc
→ video/x-raw,format=RGB16,width=1280,height=720,framerate=25/1
→ videoconvert
→ video/x-raw,format=BGRA
→ appsink
→ QVideoSink
→ QML VideoOutput

Preview is stable at 25 fps, 1280x720.

So functionally everything is correct.


The Real Problem

CPU usage is between 60% and 75%, just for preview.

At the same time:

  • GPU usage is around 5%

  • GPU is almost idle

This clearly means the camera path is CPU-bound.

After profiling, I see:

  • RGB16 → BGRA conversion (videoconvert) consumes significant CPU

  • In appsink, frames are copied multiple times

  • Qt uploads texture from CPU to GPU

  • Around 250+ MB/s of memory bandwidth is being used for nothing but copying pixels

So even though we have a GPU (Vivante GC7000L), almost the entire pipeline is CPU-based.


Why This Feels Wrong

The hardware clearly supports:

  • DCMIPP ISP

  • DMA

  • Vivante GPU with EGL

  • Wayland + OpenGL ES

Architecturally, this should be possible:

IMX335
→ DCMIPP ISP
→ libcamerasrc (DMA-BUF)
→ glupload (EGL import)
→ glcolorconvert (shader)
→ qmlglsink
→ QML scene graph

This would keep frames on the GPU from capture to display.

But currently I cannot reach this architecture.


Main Blockers

1) qmlglsink Not Available

The correct solution seems to be:

libcamerasrc → glupload → glcolorconvert → qmlglsink

However:

gst-inspect-1.0 qmlglsink
→ No such element

It seems the Qt6 GStreamer plugin from gst-plugins-bad is not packaged in OpenSTLinux 6.0.

Is there an official ST package or Yocto recipe for this?


2) DMA-BUF Heap Not Enabled

There is no:

/dev/dma_heap/

Kernel config options appear missing:

  • CONFIG_DMABUF_HEAPS

  • CONFIG_DMABUF_HEAPS_SYSTEM

  • CONFIG_DMABUF_HEAPS_CMA

Without DMA-BUF heap support, true zero-copy EGL import may not be possible.

Is this intentionally disabled in STM32MP2 BSP?


3) Qt6 Removed RGB565 Support

libcamera outputs RGB16 (RGB565) efficiently.

But Qt6 QVideoFrameFormat does not support RGB565 anymore.

So I am forced to convert to 32-bit (BGRA/RGBx) before sending to QVideoSink.

That conversion alone costs a lot of CPU.

Is there a recommended Qt6-based approach on STM32MP2 to avoid this conversion?


My Question to the Community

Has anyone successfully implemented:

  • libcamera

  • Qt6 QML

  • GPU-accelerated preview

  • Zero-copy DMA-BUF path

on STM32MP257 or STM32MP2 family?

1 ACCEPTED SOLUTION

Accepted Solutions
Dhanakrishna_Chaitanya
Associate II

Hi Yassine_behilil,

Yes — I now have a working solution on STM32MP2 with libcamera and OpenGL.

Instead of going through GStreamer, I implemented a direct pipeline using DMA-BUF and EGL. The current flow is:

libcamera → DMA-BUF (FD) → EGLImage (EGL_LINUX_DMA_BUF_EXT) → GL texture (GL_TEXTURE_EXTERNAL_OES) → Qt/OpenGL rendering

 

In this approach:

Frames are never copied to CPU (zero-copy path)

I use libcamera requestCompleted() to get the DMA FD

The FD is passed to the UI thread via a callback

EGLImage is created once per FD and reused (buffer pool)

glEGLImageTargetTexture2DOES is used to bind it to a texture

This significantly reduces CPU usage.

 

However, there are some important constraints:

Format must be GPU/EGL compatible (only RGB565 works)

Proper thread handling is required (UI thread must do EGL/GL work)

EGLImage creation must be cached per FD to avoid overhead.

View solution in original post

4 REPLIES 4
jumman_JHINGA
Senior III

With neon accelerator it will reduce 10 to 15% cpu usage 

Dhanakrishna_Chaitanya
Associate II

Waiting for Reply.. 

Yassine_behilil
Associate II

Hi Dhanakrishna,

Did you finally find a real solution for this issue?

I am very interested because I am facing almost the same problem on STM32MP2: the camera preview works, but the CPU still does most of the work while the Vivante GPU stays mostly idle.

Were you able to make the pipeline GPU-accelerated, especially with DMA-BUF zero-copy, EGL import, glupload/glcolorconvert, or qmlglsink?

In other words, did you manage to reach something close to:

libcamerasrc -> glupload -> glcolorconvert -> qmlglsink -> QML

Or did you find another working architecture to reduce CPU load?

Any update would be very useful.

Thanks.

Dhanakrishna_Chaitanya
Associate II

Hi Yassine_behilil,

Yes — I now have a working solution on STM32MP2 with libcamera and OpenGL.

Instead of going through GStreamer, I implemented a direct pipeline using DMA-BUF and EGL. The current flow is:

libcamera → DMA-BUF (FD) → EGLImage (EGL_LINUX_DMA_BUF_EXT) → GL texture (GL_TEXTURE_EXTERNAL_OES) → Qt/OpenGL rendering

 

In this approach:

Frames are never copied to CPU (zero-copy path)

I use libcamera requestCompleted() to get the DMA FD

The FD is passed to the UI thread via a callback

EGLImage is created once per FD and reused (buffer pool)

glEGLImageTargetTexture2DOES is used to bind it to a texture

This significantly reduces CPU usage.

 

However, there are some important constraints:

Format must be GPU/EGL compatible (only RGB565 works)

Proper thread handling is required (UI thread must do EGL/GL work)

EGLImage creation must be cached per FD to avoid overhead.