GPU2D trouble on STM32H7S78-DK

Jack3 · ‎2025-09-21

Hi, I have been using NemaGFX on STM32U5G9J-DK2 kit with no problems however my luck with the STM32H7S78-DK kit is different.

1. The NemaGFX for this platform only seems to work with 32-bit framebuffers, not 24-bit ones. Or am I doing something wrong?

2. The NemaGFX for this platform uses the beginning of the framebuffer as a sort of workspace. We see random dots at the beginning. I think I'm having trouble allocating buffers correctly.

3. I can draw filled rectangles, but filled circles with 'nema_vg_draw_circle()' only show the edge slightly and produce small images in a sort of workspace at the beginning of the frame buffer.
nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO); does not fill the shape, as it should.

Well, 'nema_vg_draw_rect()' works fine with nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO).

4. Many more graphics functions don't work properly on the STM32H7S78, but they all work fine on the STM32U5G9J, which one uses 24-bit framebuffers, while the STM32H7S78 can only work with 32-bit framebuffers, it seems. This could be a reason.

nema_vg_set_fill_rule(NEMA_VG_FILL_EVEN_ODD); It does fill the circle, but there still are the four small images in the frame buffer displayed in white.

What am I doing wrong?

So I use the next framebuffer, and I don't have the random pixels at the top, but it doesn't solve drawing a filled circle, rounded filled rectangle, and a lot of other functions don't work properly.

Note the filled rectangle with no rounded corners works, but a rounded filled rectangle only shows the corners somewhat, it has the same problem as the filled circle:

nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);

nema_vg_set_fill_rule(NEMA_VG_FILL_EVEN_ODD); (this is not the solution!)

What did I miss?
I have been careful with caching, I think, the GPU2D works from memory, not cache.

When I run nema_stencil_init(), a row of white dots appears at the beginning of the framebuffer.
This happens if this function calls one of these:

- nema_vg_init()
- nema_vg_init_stencil_pool(DISPLAY_SIZE_W, DISPLAY_SIZE_H, 1);
- nema_vg_init_stencil_prealloc(fbo.w, fbo.h, stencil_bo);

So something is definitely wrong. Even when nema_vg_init() is executed, this seems to happen, regardless of where my framebuffer is located.

What can I to better?

Thanks for any help, it is much appreciated!

BTW, on a STM32U5G9J-DK2 (m33) all works fine
Please notice the versions, the low version show rendering issues:
STM32U5x9 v3.5.0 - Works fine
STM32H7RS v1.3.0 - Shows rendering issues
STM32N6 v1.3.0 - Shows rendering issues

JBJOE.1 · ‎2025-10-30

Hi @Jack3

Let's try to take a step back and start from one of the NeoChromSDK examples (I tested with version 1.3.0):
start with this:
\NeoChromSDK\Projects\STM32H7S78-DK\Applications\GPU2D\vector_watchface\
Find the file
\NeoChromSDK\Projects\STM32H7S78-DK\Applications\GPU2D\vector_watchface\Src\svg.c

Replace the app_main() function in that file with this:

void app_main()
{

  nema_vg_init_stencil_pool(RESX, RESY, STENCIL_MEM_POOL_ID);

  nema_cmdlist_t cl = nema_cl_create_sized(8 * 1024);
  nema_cl_bind_sectored_circular(&cl,8);
  
  unsigned char green_channel = 0x00;

  while(run)
  {

    img_obj_t *fbo = get_current_framebuffer();
    nema_bind_dst_tex((uintptr_t)fbo->bo.base_phys, fbo->w, fbo->h, fbo->format, fbo->stride);
    
    nema_set_clip(0, 0, RESX, RESY);
    nema_clear(nema_rgba(0xff, green_channel++, 0, 0xff));

    nema_matrix3x3_t m;
    NEMA_VG_PAINT_HANDLE paint;
    paint = nema_vg_paint_create();

    nema_mat3x3_load_identity(m);
  //  nema_mat3x3_translate(m, 0.0, 0.0); // fun to play with 
  //  nema_mat3x3_rotate(m, 0.0); // fun to play with
  //  nema_mat3x3_scale(m, 1.0, 1.0); // fun to play with 

    nema_vg_paint_set_type(paint, NEMA_VG_PAINT_COLOR);
    nema_vg_paint_set_opacity(paint, 1.0);
    nema_vg_paint_set_paint_color(paint, nema_rgba(0xFF, green_channel, 0xFF, 0xFF));
    nema_vg_stroke_set_width(15.0);

    // Choose a fill rule
    nema_vg_set_fill_rule(NEMA_VG_STROKE);
//    nema_vg_set_fill_rule(NEMA_VG_FILL_EVEN_ODD);
//    nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);
    
    // Basic blend mode setup for now
    nema_set_blend_fill(NEMA_BL_SRC_OVER);
    nema_vg_set_blend(NEMA_BL_SRC_OVER);

    nema_enable_aa_flags(1);

    nema_vg_set_quality(NEMA_VG_QUALITY_MAXIMUM); 

    nema_vg_draw_rect(80, 140, 200, 200, m, paint);
    nema_vg_draw_circle(400, 240, 100, m, paint);
    nema_vg_draw_rounded_rect(520, 140, 200, 200, 40, 40, m, paint);

    nema_cl_submit(&cl);
    nema_cl_wait(&cl);
    
    swap_buffers(); // wait for sync up with the LTDC controller

    nema_vg_paint_destroy(paint);
  }
  nema_cl_unbind();
  nema_cl_destroy(&cl);
}

This example draws your three vg shapes. All the exotic blending operations from previous is reduced to simple NEMA_BL_SRC_OVER. The rending is in sync with the LTDC by calling swap_buffers(). In this configuration you can play with nema_vg_set_quality() and nema_vg_set_fill_rule(). This configuration draws every time on my board.
Can you reproduce this setup?

Regards,
Jakob

Jakob BJOERN
Senior Software Engineer | STM32 Graphics

View solution in original post

JBJOE.1 · ‎2025-09-23

Hi @Jack3

Thorough work, and interesting details.
First and foremost I can clear up that the GPU2D in the STM32H7R/S does not handle 24bpp natively. This applies for both input and output (framebuffer). What I see people do is either choose 16bpp or a full 32bpp framebuffer. But there is a third option that uses another IP of the STM32H7R/S called the GFXMMU. It can be configured in "packing mode". In this mode the Virtual buffers of the GFXMMU can expose an 24bpp framebuffer as if it was a 32bpp framebuffer. So while the GPU2D in the STM32H7R/S does not handle 24bpp, the overall system does. If you want to get a fast start using the gfxmmu in packing mode I suggest to start from the TouchGFX designer. It has a Template TBS called "STM32H7S78 DK 24 BPP". Have a look in MX_GFXMMU_INIT(), the most important stuff here is the BlockSize, BufXAddress and Packing:

static void MX_GFXMMU_Init(void)
{
  GFXMMU_PackingTypeDef pPacking = {0};
  hgfxmmu.Instance = GFXMMU;
  hgfxmmu.Init.BlockSize = GFXMMU_12BYTE_BLOCKS;
  hgfxmmu.Init.DefaultValue = 0;
  hgfxmmu.Init.AddressTranslation = DISABLE;
  hgfxmmu.Init.Buffers.Buf0Address = 0x90000000;
  hgfxmmu.Init.Buffers.Buf1Address = 0x90200000;
  hgfxmmu.Init.Buffers.Buf2Address = 0;
  hgfxmmu.Init.Buffers.Buf3Address = 0;
  hgfxmmu.Init.Interrupts.Activation = DISABLE;
  if (HAL_GFXMMU_Init(&hgfxmmu) != HAL_OK)
  {
    Error_Handler();
  }
  pPacking.Buffer0Activation = ENABLE;
  pPacking.Buffer0Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.Buffer1Activation = ENABLE;
  pPacking.Buffer1Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.Buffer2Activation = DISABLE;
  pPacking.Buffer2Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.Buffer3Activation = DISABLE;
  pPacking.Buffer3Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.DefaultAlpha = 0xFF;
  if (HAL_GFXMMU_ConfigPacking(&hgfxmmu, &pPacking) != HAL_OK)
  {
    Error_Handler();
  }
}

The second thing that I can point out is that the STM32H7R/S has a texture cache that is a little different than on STM32U5G9. In STM32H7R/S the texture cache is implemented with the ICACHE instance. It is correct that the GPU2D works with memory but for speeding up reading source images from external Flash the Texture cache (ICACHE) boost performance. This is great but for drawing Vector Graphics the GPU2D in some operations uses the Stencil buffer to do intermediate calculations. Without proper cache maintenance the ICACHE end up with stale data when reading back from the stencil buffer. All the Template TBS projects in TouchGFX designer has this setup, I suggest to have a look in one of these. The highlights of the maintenance is:

In nema_hal.c

void HAL_GPU2D_ErrorCallback(GPU2D_HandleTypeDef *hgpu2d)
{
    uint32_t val = nema_reg_read(GPU2D_SYS_INTERRUPT); /* clear the ER interrupt */
    nema_reg_write(GPU2D_SYS_INTERRUPT, val);
    /* external GPU2D cache maintenance */
    if (val & (1UL << 2))
    {
        HAL_ICACHE_Disable();
        nema_ext_hold_deassert_imm(2);
    }
    if (val & (1UL << 3))
    {
        HAL_ICACHE_Enable();
        HAL_ICACHE_Invalidate();
        nema_ext_hold_deassert_imm(3);
    }
    return;
}
void platform_disable_cache(void)
{
    nema_ext_hold_assert(2, 1);
}
void platform_invalidate_cache(void)
{
    nema_ext_hold_assert(3, 1);
}

platform_disable_cache() and platform_invalidate_cache() are called from inside the nemagfx library.
In the TouchGFXConfiguration.cpp (or any other init section) we need to enable the user configurable interrupts from the GPU2D. Look for nema_ext_hold_xxx :

void touchgfx_components_init()
{
    nema_init();
    nema_reg_write(0xFFC, 0x7E); /* Enable bus error interrupts */
    nema_vg_init_stencil_pool(800, 480, 1);
    nema_vg_handle_large_coords(1, 1);
    nema_ext_hold_enable(2);
    nema_ext_hold_irq_enable(2);
    nema_ext_hold_enable(3);
    nema_ext_hold_irq_enable(3);
}

Regarding memory allocation and framebuffer location, this is similar to the STM32U5G9. In short the nemagfx library needs to have a small amount of memory to allocate from. This memory is always drawn from memory pool 0 if not specifically specified. This means that the implementation of nema_buffer_create_pool() (in nema_hal.c) needs to allow for multiple allocations from pool 0. One way of solving this can be seen in the nema_hal.c file generated by the STM32H7S78 TouchGFX TBS. Pool zero does typically not need to be especially big if the Stencil buffer is allocated from a different pool.

Stencil buffer creation is done as part of the call to nema_vg_init(), instead of this we can control which pool this should be allocated from like this:

nema_vg_init_stencil_pool(800, 480, 1). Each pool can be place in specific locations in memory with the help of the linker or with a manually maintained memory map. Notice that performance can be affected if the framebuffers and stencil buffer are not 8bytes aligned, again this can also be seen in the TBS for the STM32H7S78.
I hope it helps you further.
Regards,

Jakob BJOERN
Senior Software Engineer | STM32 Graphics

Jack3 · ‎2025-09-23

Hi Jakob,

Thank you! This is very helpful information!
This gives me the opportunity to learn more about GPU2D/NemaGFX on the STM32H7S78.

I was confident I would get it working with your help!
However, I use it baremetal, without RTOS, so there's just that one small glitch in my project that prevents it from working.

It probably has to do with some of those things:

1. Initializing the GPU properly.

2. Initalizing "nemagfx_pool_mem", how to do this correctly on a baremetal solution.

3. Initalizing "nemagfx_ring_buffer_mem", how to do this correctly on a baremetal solution.

4. Telling NemaGFX to use a small memory space *outside* the framebuffer. It now produces ghost images.

Observations:

1. ICHACHE, it seems not to work yet.

2. I called these, in this particular order:

  nema_init();

  nema_reg_write(0xFFC, 0x7E); /* Enable bus error interrupts */
  nema_vg_init_stencil_pool(800, 480, 1);
  nema_vg_handle_large_coords(1, 1);
  nema_ext_hold_enable(2);
  nema_ext_hold_irq_enable(2);
  nema_ext_hold_enable(3);
  nema_ext_hold_irq_enable(3);

  nema_sys_init();

Note:
"GPU2D_CommandListCpltCallback" is in nema_hal.c is called.
"HAL_GPU2D_ErrorCallback" is never called.

3. In the end, I had two buffers with NEMAGFX_MEM_POOL_SIZE: nemagfx_pool_mem and nemagfx_ring_buffer_mem. So I had to modify the linker file to accommodate for both. That doesn't seem right.
What's the correct way to initialize both in a baremetal project?

This is what I changed:

1. I forgot to include the HAL_ICACHE_MODULE (ICHACHE_GPU2D) on the STM32H7S78!
So I've added the ICHACHE_GPU2D functionality in the project.

2. I updated my nema_hal to use the ICACHE module.

3. I called the functions in "touchgfx_components_init()", enabling the required interrupts.
I noticed "HAL_GPU2D_ErrorCallback" never fires, so the ICACHE functionality probably don't work, yet.
I also tried to call it from the GPU2D_IRQHandler. Then it triggers, but the graphic results didn't change.

4. I've also added the GFXMMU functionality in the project to support 24-bit framebuffers, and modfied "MX_LTDC_Init" to use the GFXMMU base address as the LTDC start address.
Despite this, in 24-bit mode, vertical lines were displayed over the full width with no GPU2D output seen.
I would be able to get this working, but only once the GPU is up and running. First things first, I think.
5. "nema_buffer_create_pool" has been made to work and effectively selects a framebuffer to draw to, which works. It no longer hangs.

I don't quite understand initializing the stencil buffer and the concept of POOL_IDs.

And I don't understand the part: "Regarding memory allocation and framebuffer location, this is similar to the STM32U5G9. In short the nemagfx library needs to have a small amount of memory to allocate from."
I think the frame buffer is rather quite large (800x480x4 bytes), so it resides in the external RAM of the STM32H7S78. The GPU can apparently access it, as it can draw some shapes. So I assume that's configured correctly.
But I think it's a small pool of working space that really needs to be placed outside the framebuffer.
The unwanted ghosting seems to be related to borrowing it from the actual framebuffer for no apparent reason.
It's this part of the initializing that I don't understand. I need to find a way to specify another place for this workspace, not inside the framebuffer. I can't get this done, yet.

So we I can render some things (rectangles with sharp corners, for example), but a lot of things aren't working yet, like filled circles or filled rounded rectangles.

With stroked versions of these, they work surprisingly well, like before. I just wanted to note this again.

nema_vg_set_fill_rule(NEMA_VG_STROKE);

nema_vg_set_fill_rule(NEMA_VG_FILL_EVEN_ODD);

Note: Ghost images at the top: NemaGFX uses workspace inside the framebuffer!
The shown ghost images only reflect the rendering of the two right primitives.

nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);

Note: Ghost images at the top: NemaGFX uses workspace inside the framebuffer!
Remember, the in the above picture shown shown ghost images reflected the two right primitive.
They no longer appear in the image below, using NEMA_VG_FILL_NON_ZERO.

After this, I start drawing lots of random colored stroked rectangles at random positions 80x80, stroke 10.
STM32H7S7L8 600MHz 32BPP: 7120 rectangles/sec.
STM32U5G9ZJ 170MHz 24BPP: 11200 rectangles/sec.

My reworked nema_hal.

I briefly thought TouchGFX might be useful for initializing and then using NemaGFX to create my own graphical widgets.
So, I created a project with 4.26.0 that displays a button.
I looked at the generated code, but it doesn't contain any NemaGFX functions, so it might not be using GPU2D.

JBJOE.1 · ‎2025-09-25

Hi @Jack3

The TouchGFX application you created uses GPU2D, but the calls to the nema library are part of the touchgfx libraries. But the initialization of GPU2D, GFXMMU, LTDC etc are all part of your generated application.

As I see it there are three topics in this:
1) Setting up memory pools so the nemagfx library can allocate memory dynamically.

2) Setting up cache management when drawing vector graphics.

3) GFXMMU config

For 1) The concept of pool IDs is designed to have a way to direct memory allocation to different memory types depending on what the platform can provide. On the STM32H7 one will often have the framebuffer in external memory but try to have the stencil buffer in internal memory.
Memory wise we often have these categories of memory needs:
Command list and ring buffer memory - This needs to be your fastest memory seen from the GPU2D. It is the easiest if you make this pool ID 0. The size of the pool is typically around 16kB. If you allocate too much make sure to have an assertion or similar if you run our of memory (see nema_buffer_create_pool() in the generated application).

Stencil buffer - This pool needs to be the size of you bound output destination, in you case it will be one byte per pixel in your framebuffer. Having this as a separate makes it possible to move it around if your applications grows. Ideally you will need the stencil buffer in internal SRAM if possible, but it will also work in external memory it will just have a performance penalty. speaking about performance the stencilbuffer likes to be 8 bytes aligned in memory. this makes the reads/writes more effective.

Frame buffer - It is optional if you want to allocate the framebuffers in the same way as the two other memory categories. The TouchGFX implementation chose to allocate the framebuffer without the use of nema_buffer_create_pool(), but then you need another way of making sure it is 8 bytes alligned

The initialization of the pools are happening in nema_sys_init(). You already have this in the your nema_hal.c. Try to debug the inputs to the calls to tsi_malloc_init_pool_aligned(). In your current implementation every attempt to allocate from pool zero will return the address of FrameBuffer[0]. This will not work, multiple variable s needed by the nemagfx lib will use the same memory. I recommend using the tsi_malloc solution, but it is entirely up to you, it is only a memory allocator.

For 2) Setting up the ICACHE (Texture cache) is of course a step in the right direction. When the ICACHE is enabled the cache maintenance is needed. On the other side if the ICACHE is disabled it is not important. So if your initial setup did not configure the texture cache then you have not seen these problems yet. When texture cache is enabled and you are drawing with eg. nema_vg_draw_ring() you should see calls to:

void platform_disable_cache(void)
{
  nema_ext_hold_assert(2, 1);
}

void platform_invalidate_cache(void)
{
  nema_ext_hold_assert(3, 1);
}

It seems from your code that the nema_ext_hold signals are enabled. try to set breakpoints in these function and excise the nema_vg_draw_xxx calls. If you are still seeing the breakpoints being hit, then check if your application is linking the correct nemagfx library. The STM32H7S7 is a Cortex M7 so the lib you need will be in this folder:
\YOUR_STM32H7S78_TOUCHGFX_APP\Appli\Middlewares\ST\touchgfx_components\gpu2d\NemaGFX\lib\core\cortex_m7\
If you are running another compiler than the TBS, you can find the full set of libs and headers here:
C:\TouchGFX\4.26.0\touchgfx_components\gpu2d\NemaGFX_NemaP_m7_r01\

For 3) The initialization of the GFXMMU I mentioned previously only links some physical memory to be mapped by the GFXMMU virtual buffers. To access the 32bpp version of the framebuffers you should go through the memory mapped addresses, GFXMMU_VIRTUAL_BUFFER0_BASE and GFXMMU_VIRTUAL_BUFFER1_BASE.

So in code you will have something like:

nema_bind_dst_tex(GFXMMU_VIRTUAL_BUFFER0_BASE , 800, 480, NEMA_RGBA8888, -1);

If you are rendering to framebuffer 0.
You do not need to have the LTDC access the framebuffer through the virtual buffers, but of course make sure the LTDC reads the physical memory as 24BPP like you did on STM32U5G9.

I hope it makes you go further. In any circumstance you have a working template in the TouchGFX application.

Regards,

Jakob BJOERN
Senior Software Engineer | STM32 Graphics

Jack3 · ‎2025-09-25

Hi Jakob, thank you!

I addressed above issues. Though, the rendering problems with the shapes remain the same.

Ruled out:

- The framebuffer depth.
- The version of NemaGFX.
- The initialization of NemaGFX.

What changed:

1) Corrected set up of memory pools for the nemagfx library.
2) Configured cache management when drawing vector graphics.
3) Tested 24-bit framebuffers using GFXMMU (shows no difference).

Note:

- NEMAGFX_STENCIL_POOL_SIZE = 389120, which is 800*480 +5120.
Using 384000 does not work. Why are the extra 5120 bytes required to let the GPU work?

In this example, the last two functions are not renedering correctly. However, there are more functions that expose renderings issues.

nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);
nema_vg_draw_rect(80, 140, 200, 200, NemaGFX.m, NemaGFX.paint); /* Works */
nema_vg_draw_circle(400, 240, 100, NemaGFX.m, NemaGFX.paint); /* Fails*/
nema_vg_draw_rounded_rect(520, 140, 200, 200, 40, 40, NemaGFX.m, NemaGFX.paint); /* Fails*/

The difference between the STM32H7S7 and the STM32U5G9:

The STM32H7S7 (Cotex-M7) reveals rendering issues in the 2nd and 3rd shape:

The STM32U5G9 (Cotex-M33) works flawless:

Due to the failing GPU2D/NemaGFX on the STM32H7RS (Cotex-M7) and the STM32N6 (Cotex-M55), we will instead use the STM32U5G9J (Cotex-M33) which works flawless after successful tests.

And for those interested, there's the revised nema_hal.c, linker file, and ioc file.
Feedback is always welcome!
If I find a solution for the STM32H7RS, I'll be sure to post an update on the cause of the problem.
It may be my mistake, though I have no issues with the STM32U5G9J.

I noticed the versions below:
STM32U5x9 v3.5.0 - Works fine
STM32H7RS v1.3.0 - Reveals rendering issues with the vg libraries
STM32N6 v1.3.0 - Reveals rendering issues with the vg libraries

So I assume the NemaGFX v1.3.0 versions for the m7 and m55 aren't finished yet, regarding the vg libraries.
I also noticed the STM32N6 has a bug when using the NEMA_BGR565 format, while NEMA_RGB565 works fine.

JBJOE.1 · ‎2025-10-27

Hi @Jack3

I have recreated your described setup, and I now understand broadly what is going on. And I even found a workaround that makes it possible to render the rounded rect with fill rule set to NEMA_VG_FILL_NON_ZERO.

In fact this bug is also showing with nema_vg_draw_ellipse(), but that is also the only draw functions that are affected. One could argue that since this only applies to ellipse and rounded rect, then it should not matter which fill rule are used (Non zero or even odd), the end result is the same.

But should you want to keep the NEMA_VG_FILL_NON_ZERO overall when drawing rounded rectangles you can make this work by setting the VG Quality parameter to max. So you want to do something like this:

    nema_vg_set_quality(NEMA_VG_QUALITY_MAXIMUM);
    nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);

    nema_mat3x3_load_identity(m);
    nema_vg_paint_set_type(paint, NEMA_VG_PAINT_COLOR);
    nema_vg_paint_set_paint_color(paint, nema_rgba(0xff,0x68,0xb4,0xff));

    nema_vg_draw_circle(180, 240, 100, m, paint); 
    nema_vg_draw_ellipse(400, 240, 100, 50, m, paint); 
    nema_vg_draw_rounded_rect(520, 140, 200, 200, 40, 40, m, paint);

The reason why this works lies in the way that the library optimizes. Some drawing operations even though they are vector graphics does not need to render using the stencil buffer, this improves performance, but unfortunately in this case there is a bug in this corner with the current library. When setting NEMA_VG_QUALITY_MAXIMUM all vg drawing operations are forced through the stencil buffer which you will see works as expected.

Most drawing operations are always using the stencil buffer, including vector fonts. And it is also in vector fonts where we typically see an importance for using the right NEMA_VG_FILL_*. But I would say it is generally accepted that if a font renders different with "non zero" and "even odd" then the font is ill designed.

Anyway, the bug has been reported to the backend team and I hope it will be fixed in the next release of the library. Until then I recommend keeping with NEMA_VG_FILL_EVEN_ODD or NEMA_VG_QUALITY_MAXIMUM.

Regards,

Jakob BJOERN
Senior Software Engineer | STM32 Graphics

Jack3 · ‎2025-10-28

Hi Jakob!

Thank you very much for responding.

We always need to set:

    nema_vg_set_quality(NEMA_VG_QUALITY_MAXIMUM);

Otherwise, a 45 degree rotated square is cut from the interior of a filled circle.

So even though I set it as the default, it didn't solve the problem.
Below is my code snippet for testing:

  Display.NemaGFX.cl = nema_cl_create_sized(8192);
  nema_cl_bind_circular(&Display.NemaGFX.cl);

  nema_bind_dst_tex(Display.NemaGFX.fbo.bo.base_phys, Display.NemaGFX.fbo.w, Display.NemaGFX.fbo.h, Display.NemaGFX.fbo.format, Display.NemaGFX.fbo.stride);
  nema_bind_src_tex(Display.NemaGFX.fbo.bo.base_phys, Display.NemaGFX.fbo.w, Display.NemaGFX.fbo.h, Display.NemaGFX.fbo.format, Display.NemaGFX.fbo.stride, NEMA_FILTER_PS);

  nema_set_clip(0, 0, DISPLAY_SIZE_W, DISPLAY_SIZE_H);

  nema_set_blend_fill(NEMA_BL_SRC_OVER);

  nema_blending_mode(NEMA_BF_SRCALPHA, NEMA_BF_INVSRCALPHA, NEMA_BLOP_MODULATE_A);

  Display.NemaGFX.paint = nema_vg_paint_create();
  nema_vg_paint_clear(Display.NemaGFX.paint);

  nema_mat3x3_load_identity(Display.NemaGFX.m);
  nema_mat3x3_translate(Display.NemaGFX.m, 0.0, 0.0);
  nema_mat3x3_rotate(Display.NemaGFX.m, 0.0);
  nema_mat3x3_scale(Display.NemaGFX.m, 1.0, 1.0);

  nema_vg_paint_set_type(Display.NemaGFX.paint, NEMA_VG_PAINT_COLOR);
  nema_vg_paint_set_opacity(Display.NemaGFX.paint, 1.0);
  nema_vg_paint_set_paint_color(Display.NemaGFX.paint, nema_rgba(0xFF, 0x00, 0xFF, 0xFF));
  nema_vg_stroke_set_width(15.0);

  nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);
  nema_vg_set_blend(NEMA_BL_SRC_OVER| NEMA_BLOP_SRC_PREMULT | NEMA_BLOP_SRC_CKEY);

  nema_enable_aa_flags(1);
  nema_enable_aa(1, 1, 1, 1);

  /**
   * This quality setting is required to get nice rounded corners and avoid
   * displaying a cut-out 45 degrees rotated square inside a filled circle!
   */
  nema_vg_set_quality(NEMA_VG_QUALITY_MAXIMUM);

  nema_vg_draw_rect(80, 140, 200, 200, Display.NemaGFX.m, Display.NemaGFX.paint);
  nema_vg_draw_circle(400, 240, 100, Display.NemaGFX.m, Display.NemaGFX.paint);
  nema_vg_draw_rounded_rect(520, 140, 200, 200, 40, 40, Display.NemaGFX.m, Display.NemaGFX.paint);

  nema_cl_unbind();
  nema_cl_submit(&Display.NemaGFX.cl);
  nema_cl_wait(&Display.NemaGFX.cl);
  nema_cl_destroy(&Display.NemaGFX.cl);

  nema_vg_paint_destroy(Display.NemaGFX.paint);

Am I perhaps missing something else?

Kind regards, Jack.

JBJOE.1 · ‎2025-10-28

Hi @Jack3

Thank you for the code block, it removed some of my guessing. It seems that I have not recreated your setup fully since I can't seen the issue you describe.
One thing I think that will benefit in the debugging is to extend your HAL_GPU2D_ErrorCallback() function. Any bus error reported by the GPU2D is reported via the system interrupts. Besides the texture cache management system no other interrupts should be coming from the GPU.

I would typically have an assert or a simple freeze in this function if I receive something unexpected. Try to modify you r function to something like this:

void HAL_GPU2D_ErrorCallback(GPU2D_HandleTypeDef *hgpu2d)
{
    (void)hgpu2d; // explicitly mark as unused to avoid unused parameter warning
    uint32_t val = nema_reg_read(GPU2D_SYS_INTERRUPT);

    /* clear the ER interrupt */
    nema_reg_write(GPU2D_SYS_INTERRUPT, val);

    if (val & ~0xFU) {
        /* unexpected error */
        for (;;);
    }

    /* external GPU2D cache maintenance */
    if (val & (1UL << 2)) {
        HAL_ICACHE_Disable();
        nema_ext_hold_deassert_imm(2);
    }
    if (val & (1UL << 3)) {
        HAL_ICACHE_Enable();
        HAL_ICACHE_Invalidate();
        nema_ext_hold_deassert_imm(3);
    }
}

Anything that cases us to hang in the infinite for loop is because the GPU2D detected an error. Having this makes sure nothing slips through during development. Your current implement implementation just ignores these errors.

I took your code snippet with a few modifications. I can't get it to fail when I am using NEMA_VG_QUALITY_MAXIMUM. Can you try this in your setup:

  cl = nema_cl_create_sized(8 * 1024);
  nema_cl_bind_sectored_circular(&cl,8);
  nema_set_clip(0, 0, RESX, RESY);

  img_obj_t *fbo = get_current_framebuffer();
  nema_bind_dst_tex((uintptr_t)fbo->bo.base_phys, RESX, RESY, NEMA_RGB565, -1);
  nema_clear(nema_rgba(0, 0xff, 0, 0xff));

  nema_set_blend_fill(NEMA_BL_SRC_OVER);

//  nema_blending_mode(NEMA_BF_SRCALPHA, NEMA_BF_INVSRCALPHA, NEMA_BLOP_MODULATE_A);  // not needed / does nothing when drawing filled shapes 

  nema_matrix3x3_t m;
  NEMA_VG_PAINT_HANDLE paint;
  paint = nema_vg_paint_create();
//  nema_vg_paint_clear(paint); // not needed / does nothing when drawing filled shapes 

  nema_mat3x3_load_identity(m);
//  nema_mat3x3_translate(m, 0.0, 0.0); // does nothing 
//  nema_mat3x3_rotate(m, 0.0); // does nothing
//  nema_mat3x3_scale(m, 1.0, 1.0); // does nothing

  nema_vg_paint_set_type(paint, NEMA_VG_PAINT_COLOR);
  nema_vg_paint_set_opacity(paint, 1.0);
  nema_vg_paint_set_paint_color(paint, nema_rgba(0xFF, 0x00, 0xFF, 0xFF));
  nema_vg_stroke_set_width(15.0); // not needed / does nothing when drawing filled shapes 

  nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);
//  nema_vg_set_blend(NEMA_BL_SRC_OVER| NEMA_BLOP_SRC_PREMULT | NEMA_BLOP_SRC_CKEY); // not needed / does nothing when drawing filled shapes

  nema_enable_aa_flags(1);

  nema_vg_set_quality(NEMA_VG_QUALITY_MAXIMUM); // currently important to force Stencil buffer usage.

  nema_vg_draw_rect(80, 140, 200, 200, m, paint);
  nema_vg_draw_circle(400, 240, 100, m, paint);
  nema_vg_draw_rounded_rect(520, 140, 200, 200, 40, 40, m, paint);

  nema_cl_unbind();
  nema_cl_submit(&cl);
  nema_cl_wait(&cl);
  nema_cl_destroy(&cl);
  nema_vg_paint_destroy(paint);

Remember you can call nema_cl_submit() at any point if you want to have the GPU2D execute the commands in the commandlist up to that point. That way you can "single step" the application in sync with the GPU2D.

Regards,

Jakob BJOERN
Senior Software Engineer | STM32 Graphics

Jack3 · ‎2025-10-29

Hi Jakob,

thank you veru much!

I modified the HAL_GPU2D_ErrorCallback function so it will hang in a loop.
The result is dat no hang occurs (after this test, rectangles with random color and position are printed endlessly).

I noticed that commenting out this line causes the rectangle on the left to no longer display:

// nema_vg_set_blend(NEMA_BL_SRC_OVER| NEMA_BLOP_SRC_PREMULT | NEMA_BLOP_SRC_CKEY); // not needed / does nothing when drawing filled shapes

The other two shapes look the same:

I tried the NemaGFX SDK v1.3.0 and the one provided by TouchGFX studio, and I'm using
libnemagfx-float-abi-hard.a

So, something else must be wrong, I guess.

My nema_hal.c:

/*
 * File            : nema_hal.c
 */

#include <display.h>
#include <stdlib.h>
#include <stm32h7s7xx.h>
#include <string.h>
#include <tsi_malloc.h>
#include "nema_hal.h"
#include "nema_vg.h"
#include "nema_graphics.h"
#include "main.h"
#include "gpu2d.h"
#include "gfxmmu.h"
#include "stm32h7rsxx_hal.h"
#include "stm32h7rsxx_hal_gpu2d.h"

#define RING_SIZE                   1024    /* Ring Buffer Size in byte */
#define NEMAGFX_MEM_POOL_SIZE       24320   /* NemaGFX byte pool size in byte */
#define NEMAGFX_STENCIL_POOL_SIZE   389120  /* NemaGFX stencil buffer pool size in byte 800*480+5120 */

/* RAM_CMD */
static uint8_t nemagfx_pool_mem[NEMAGFX_MEM_POOL_SIZE] __attribute__((section("Nemagfx_Memory_Pool_Buffer"))); /* NemaGFX memory pool */
static uint8_t nemagfx_stencil_buffer_mem[NEMAGFX_STENCIL_POOL_SIZE] __attribute__((section("Nemagfx_Stencil_Buffer"))); /* NemaGFX stencil buffer memory */

typedef struct
{
  uint8_t B;
  uint8_t G;
  uint8_t R;
  uint8_t A;
} Color_RGBA8888_t;

Color_RGBA8888_t FrameBuffer[DISPLAY_NO_OF_FRAMEBUFFERS][DISPLAY_SIZE_H][DISPLAY_SIZE_W] __attribute__((section("Nemagfx_Framebuffer")));

static nema_ringbuffer_t ring_buffer_str = {{0}};

static volatile int last_cl_id = -1;

#if (USE_HAL_GPU2D_REGISTER_CALLBACKS == 1)
static void GPU2D_CommandListCpltCallback(GPU2D_HandleTypeDef* hgpu2d, uint32_t CmdListID)
#else /* USE_HAL_GPU2D_REGISTER_CALLBACKS = 0 */
void HAL_GPU2D_CommandListCpltCallback(GPU2D_HandleTypeDef* hgpu2d, uint32_t CmdListID)
#endif /* USE_HAL_GPU2D_REGISTER_CALLBACKS = 1 */
{
  UNUSED(hgpu2d);

  last_cl_id = CmdListID;
}

void HAL_GPU2D_ErrorCallback(GPU2D_HandleTypeDef *hgpu2d)
{
  /* Explicitly mark as unused to avoid unused parameter warning */
  UNUSED(hgpu2d);

  uint32_t val = nema_reg_read(GPU2D_SYS_INTERRUPT);
  /* Clear the ER interrupt */
  nema_reg_write(GPU2D_SYS_INTERRUPT, val);

  if (val & ~0x0F)
  {
    /* Unexpected error */
    while (1);
  }

  /* external GPU2D cache maintenance */
  if (val & (1UL << 2))
  {
      HAL_ICACHE_Disable();
      nema_ext_hold_deassert_imm(2);
  }
  if (val & (1UL << 3))
  {
      HAL_ICACHE_Enable();
      HAL_ICACHE_Invalidate();
      nema_ext_hold_deassert_imm(3);
  }
}

int32_t nema_sys_init(void)
{
  /*
   * NEMA| GFX includes the following API calls for memory allocation, deallocation and mapping:
   * - nema_buffer_create() - Allocate memory
   * - nema_buffer_create_pool() - Allocate memory from specific memory pool
   * - nema_buffer_map() - Map allocated memory space for CPU access
   * - nema_buffer_unmap() - Unmap previously mapped memory space
   * - nema_buffer_destroy() - Deallocate memory space
   *
   */
  int error_code = 0;

  /* Initialize GPU2D */
  hgpu2d.Instance = GPU2D;
  HAL_GPU2D_Init(&hgpu2d);

#if (USE_HAL_GPU2D_REGISTER_CALLBACKS == 1)
  /* Register Command List Complete Callback */
  HAL_GPU2D_RegisterCommandListCpltCallback(&hgpu2d, GPU2D_CommandListCpltCallback);
#endif /* USE_HAL_GPU2D_REGISTER_CALLBACKS = 0 */

  /* Initialize Mem Space */
  error_code = tsi_malloc_init_pool_aligned(0, (void*)nemagfx_pool_mem, (uintptr_t)nemagfx_pool_mem, NEMAGFX_MEM_POOL_SIZE, 1, 8);
  assert(error_code == 0);
  error_code = tsi_malloc_init_pool_aligned(1, (void*)nemagfx_stencil_buffer_mem, (uintptr_t)nemagfx_stencil_buffer_mem, NEMAGFX_STENCIL_POOL_SIZE, 1, 8);
  assert(error_code == 0);

  /* Allocate ring_buffer memory */
  ring_buffer_str.bo = nema_buffer_create_pool(0, RING_SIZE);
  assert(ring_buffer_str.bo.base_virt);

  /* Initialize Ring Buffer */
  error_code = nema_rb_init(&ring_buffer_str, 1);
  if (error_code < 0)
  {
    return error_code;
  }

  /* Reset last_cl_id counter */
  last_cl_id = 0;

  return error_code;
}

void nema_components_init()
{
  /* Initialize NemaGFX library */
  nema_init();

  nema_reg_write(0xFFC, 0x7E); /* Enable bus error interrupts */
  nema_vg_init_stencil_pool(DISPLAY_SIZE_W, DISPLAY_SIZE_H, 1);
  nema_vg_handle_large_coords(1, 1);
  nema_ext_hold_enable(2);
  nema_ext_hold_irq_enable(2);
  nema_ext_hold_enable(3);
  nema_ext_hold_irq_enable(3);

  nema_sys_init();
}

uint32_t nema_reg_read(uint32_t reg)
{
  return HAL_GPU2D_ReadRegister(&hgpu2d, reg);
}

void nema_reg_write(uint32_t reg, uint32_t value)
{
  HAL_GPU2D_WriteRegister(&hgpu2d, reg, value);
}

int nema_wait_irq(void)
{
  /* Wait indefinitely for a free semaphore - baremetal, not implemented */
  return 0;
}

int nema_wait_irq_cl(int cl_id)
{
  while (last_cl_id < cl_id)
  {
    (void)nema_wait_irq();
  }

  return 0;
}

int nema_wait_irq_brk(int brk_id)
{
  UNUSED(brk_id);
  while (nema_reg_read(GPU2D_BREAKPOINT) == 0U)
  {
    (void)nema_wait_irq();
  }

  return 0;
}

void nema_host_free(void *ptr)
{
  if (ptr)
  {
    tsi_free(ptr);
  }
}

void *nema_host_malloc(unsigned size)
{
  return tsi_malloc(size);
}

nema_buffer_t nema_buffer_create(int size)
{
  nema_buffer_t bo;
  memset(&bo, 0, sizeof(bo));
  bo.base_virt = tsi_malloc(size);
  bo.base_phys = (uint32_t)bo.base_virt;
  bo.size      = size;
  assert(bo.base_virt != 0 && "Unable to allocate memory in nema_buffer_create");

  return bo;
}

nema_buffer_t nema_buffer_create_pool(int pool, int size)
{
  nema_buffer_t bo;
  memset(&bo, 0, sizeof(bo));
  bo.base_virt = tsi_malloc_pool(pool, size);
  bo.base_phys = (uint32_t)bo.base_virt;
  bo.size      = size;
  bo.fd        = 0;
  assert(bo.base_virt != 0 && "Unable to allocate memory in nema_buffer_create_pool");

  return bo;
}

/* Used to select the framebuffer */
void MX_GFXMMU_Select_FB(uint8_t index)
{
  GFXMMU_PackingTypeDef pPacking = {0};

  hgfxmmu.Instance = GFXMMU;
  hgfxmmu.Init.BlockSize = GFXMMU_12BYTE_BLOCKS;
  hgfxmmu.Init.DefaultValue = 0;
  hgfxmmu.Init.AddressTranslation = DISABLE;
  hgfxmmu.Init.Buffers.Buf0Address = (uint32_t)FrameBuffer[index];
  hgfxmmu.Init.Buffers.Buf1Address = (uint32_t)FrameBuffer[index] + 0x00200000; /* 2MB ahead 0x00200000 */
  hgfxmmu.Init.Buffers.Buf2Address = 0;
  hgfxmmu.Init.Buffers.Buf3Address = 0;
  hgfxmmu.Init.Interrupts.Activation = DISABLE;
  if (HAL_GFXMMU_Init(&hgfxmmu) != HAL_OK)
  {
    Error_Handler();
  }

  pPacking.Buffer0Activation = ENABLE;
  pPacking.Buffer0Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.Buffer1Activation = ENABLE;
  pPacking.Buffer1Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.Buffer2Activation = DISABLE;
  pPacking.Buffer2Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.Buffer3Activation = DISABLE;
  pPacking.Buffer3Mode = GFXMMU_PACKING_MSB_REMOVE;
  pPacking.DefaultAlpha = 0xFF;
  if (HAL_GFXMMU_ConfigPacking(&hgfxmmu, &pPacking) != HAL_OK)
  {
    Error_Handler();
  }
}

/* Used to select the framebuffer */
nema_buffer_t nema_select_framebuffer(int index)
{
  nema_buffer_t bo;

  memset(&bo, 0, sizeof(bo));

#if (DISPLAY_BYTES_PER_PIXEL == 3)
  MX_GFXMMU_Select_FB(index);
  bo.base_virt = (void *)GFXMMU_VIRTUAL_BUFFER0_BASE;
#else
  bo.base_virt = FrameBuffer[index];
#endif
  bo.base_phys = (uint32_t)bo.base_virt;
  bo.size      = DISPLAY_FRAMEBUFFER_SIZE;
  bo.fd        = 0; /* Buffer allocated */

  return bo;
}

void *nema_buffer_map(nema_buffer_t *bo)
{
  return bo->base_virt;
}

void nema_buffer_unmap(nema_buffer_t *bo)
{
  UNUSED(bo);
}

void nema_buffer_destroy(nema_buffer_t *bo)
{
  if (bo->fd == -1)
  {
    return; /* Buffer wasn't allocated! */
  }

  tsi_free(bo->base_virt);

  bo->base_virt = (void*)0;
  bo->base_phys = 0;
  bo->size      = 0;
  bo->fd        = -1; /* Buffer not allocated */
}

uintptr_t nema_buffer_phys(nema_buffer_t *bo)
{
  return bo->base_phys;
}

void nema_buffer_flush(nema_buffer_t * bo)
{
  UNUSED(bo);
}

int nema_mutex_lock(int mutex_id)
{
  UNUSED(mutex_id);

  return 0;
}

int nema_mutex_unlock(int mutex_id)
{
  UNUSED(mutex_id);

  return 0;
}

void platform_disable_cache(void)
{
  nema_ext_hold_assert(2, 1);
}

void platform_invalidate_cache(void)
{
  nema_ext_hold_assert(3, 1);
}

For the famebuffers I use memory mapped PSRAM.
I played with configuration changes here as well, but no difference:

void MX_XSPI1_Init(void)
{

  /* USER CODE BEGIN XSPI1_Init 0 */
  /* 0x90000000 */
  /* USER CODE END XSPI1_Init 0 */

  XSPIM_CfgTypeDef sXspiManagerCfg = {0};

  /* USER CODE BEGIN XSPI1_Init 1 */

  /* USER CODE END XSPI1_Init 1 */
  hxspi1.Instance = XSPI1;
  hxspi1.Init.FifoThresholdByte = 2;
  hxspi1.Init.MemoryMode = HAL_XSPI_SINGLE_MEM;
  hxspi1.Init.MemoryType = HAL_XSPI_MEMTYPE_APMEM_16BITS;
  hxspi1.Init.MemorySize = HAL_XSPI_SIZE_256MB;
  hxspi1.Init.ChipSelectHighTimeCycle = 1;
  hxspi1.Init.FreeRunningClock = HAL_XSPI_FREERUNCLK_DISABLE;
  hxspi1.Init.ClockMode = HAL_XSPI_CLOCK_MODE_0;
  hxspi1.Init.WrapSize = HAL_XSPI_WRAP_32_BYTES;
  hxspi1.Init.ClockPrescaler = 0;
  hxspi1.Init.SampleShifting = HAL_XSPI_SAMPLE_SHIFT_NONE;
  hxspi1.Init.DelayHoldQuarterCycle = HAL_XSPI_DHQC_ENABLE;
  hxspi1.Init.ChipSelectBoundary = HAL_XSPI_BONDARYOF_8KB;
  hxspi1.Init.MaxTran = 0;
  hxspi1.Init.Refresh = 0;
  hxspi1.Init.MemorySelect = HAL_XSPI_CSSEL_NCS1;
  if (HAL_XSPI_Init(&hxspi1) != HAL_OK)
  {
    Error_Handler();
  }
  sXspiManagerCfg.nCSOverride = HAL_XSPI_CSSEL_OVR_NCS1;
  sXspiManagerCfg.IOPort = HAL_XSPIM_IOPORT_1;
  if (HAL_XSPIM_Config(&hxspi1, &sXspiManagerCfg, HAL_XSPI_TIMEOUT_DEFAULT_VALUE) != HAL_OK)
  {
    Error_Handler();
  }
  /* USER CODE BEGIN XSPI1_Init 2 */

  /* USER CODE END XSPI1_Init 2 */

}

linker file:

/**
 * @file        : LinkerScript.ld
 */

/* Entry Point */
ENTRY(Reset_Handler)

/* Highest address of the user mode stack */
_estack = ORIGIN(DTCM) + LENGTH(DTCM); /* end of "DTCM" Ram type memory */

_Min_Heap_Size  = 0x1000; /* required amount of heap */
_Min_Stack_Size = 0x1000; /* required amount of stack */

/* Memories definition */
MEMORY
{
  RAM       (rw)  : ORIGIN = 0x24000000, LENGTH = 0x0006C000  /* 0x24000000 - 0x2406C000 432kB */
  RAM_CMD   (rw)  : ORIGIN = 0x2406C000, LENGTH = 0x00006000  /* 0x2406C000 - 0x24072000 24kB */

  ITCM      (xrw) : ORIGIN = 0x00000000, LENGTH = 0x00010000
  DTCM      (rw)  : ORIGIN = 0x20000000, LENGTH = 0x00010000
  SRAMAHB   (rw)  : ORIGIN = 0x30000000, LENGTH = 0x00008000
  BKPSRAM   (rw)  : ORIGIN = 0x38800000, LENGTH = 0x00001000

  FLASH     (xr)  : ORIGIN = 0x70000000, LENGTH = 0x01000000  /* XSPI2 0x70000000 - 0x707FFFFF EXTFLASH 16MB */
  FLASH_GFX (r)   : ORIGIN = 0x71000000, LENGTH = 0x07000000  /* XSPI2 0x70800000 - 0x77FFFFFF EXTFLASH 128-16=112MB */

  EXTFLASH  (xr)  : ORIGIN = 0x70000000, LENGTH = 0x08000000  /* XSPI2 0x70000000 - 0x77FFFFFF EXTFLASH 128MB */
  EXTRAM    (rw)  : ORIGIN = 0x90000000, LENGTH = 0x02000000  /* XSPI1 0x90000000 - 0x92000000 EXTRAM    32MB */
}

/* Sections */
SECTIONS
{
  /* The startup code into "FLASH" FLASH type memory */
  .isr_vector :
  {
    . = ALIGN(4);
    KEEP(*(.isr_vector)) /* Startup code */
    . = ALIGN(4);
  } >FLASH

  /* The program code and other data into "FLASH" FLASH type memory */
  .text :
  {
    . = ALIGN(4);
    *(.text)           /* .text sections (code) */
    *(.text*)          /* .text* sections (code) */
    *(.glue_7)         /* glue arm to thumb code */
    *(.glue_7t)        /* glue thumb to arm code */
    *(.eh_frame)

    KEEP (*(.init))
    KEEP (*(.fini))

    . = ALIGN(4);
    _etext = .;        /* define a global symbols at end of code */
  } >FLASH

  /* Constant data into "FLASH" FLASH type memory */
  .rodata :
  {
    . = ALIGN(4);
    *(.rodata)         /* .rodata sections (constants, strings, etc.) */
    *(.rodata*)        /* .rodata* sections (constants, strings, etc.) */
    . = ALIGN(4);
  } >FLASH

  .ARM.extab (READONLY) : /* The READONLY keyword is only supported in GCC11 and later, remove it if using GCC10 or earlier. */
  {
    . = ALIGN(4);
    *(.ARM.extab* .gnu.linkonce.armextab.*)
    . = ALIGN(4);
  } >FLASH
  .ARM (READONLY) : /* The READONLY keyword is only supported in GCC11 and later, remove it if using GCC10 or earlier. */
  {
    . = ALIGN(4);
    __exidx_start = .;
    *(.ARM.exidx*)
    __exidx_end = .;
    . = ALIGN(4);
  } >FLASH

  .preinit_array (READONLY) : /* The READONLY keyword is only supported in GCC11 and later, remove it if using GCC10 or earlier. */
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array*))
    PROVIDE_HIDDEN (__preinit_array_end = .);
    . = ALIGN(4);
  } >FLASH

  .init_array (READONLY) : /* The READONLY keyword is only supported in GCC11 and later, remove it if using GCC10 or earlier. */
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT(.init_array.*)))
    KEEP (*(.init_array*))
    PROVIDE_HIDDEN (__init_array_end = .);
    . = ALIGN(4);
  } >FLASH

  .fini_array (READONLY) : /* The READONLY keyword is only supported in GCC11 and later, remove it if using GCC10 or earlier. */
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP (*(SORT(.fini_array.*)))
    KEEP (*(.fini_array*))
    PROVIDE_HIDDEN (__fini_array_end = .);
    . = ALIGN(4);
  } >FLASH

  /* Used by the startup to initialize data */
  _sidata = LOADADDR(.data);

  /* Initialized data sections into "RAM" Ram type memory */
  .data :
  {
    . = ALIGN(4);
    _sdata = .;        /* create a global symbol at data start */
    *(.data)           /* .data sections */
    *(.data*)          /* .data* sections */
    *(.RamFunc)        /* .RamFunc sections */
    *(.RamFunc*)       /* .RamFunc* sections */

    . = ALIGN(4);
    _edata = .;        /* define a global symbol at data end */

  } >RAM AT> FLASH

  /* Uninitialized data section into "RAM" Ram type memory */
  . = ALIGN(4);
  .bss :
  {
    /* This is used by the startup in order to initialize the .bss section */
    _sbss = .;         /* define a global symbol at bss start */
    __bss_start__ = _sbss;
    *(.bss)
    *(.bss*)
    *(COMMON)

    . = ALIGN(4);
    _ebss = .;         /* define a global symbol at bss end */
    __bss_end__ = _ebss;
  } >RAM

  /* User_heap_stack section, used to check that there is enough "RAM" Ram  type memory left */
  ._user_heap_stack :
  {
    . = ALIGN(8);
    PROVIDE ( end = . );
    PROVIDE ( _end = . );
    . = . + _Min_Heap_Size;
    . = . + _Min_Stack_Size;
    . = ALIGN(8);
  } >DTCM

  /* Remove information from the compiler libraries */
  /DISCARD/ :
  {
    libc.a ( * )
    libm.a ( * )
    libgcc.a ( * )
  }

  .ARM.attributes 0 : { *(.ARM.attributes) }

  BufferSection (NOLOAD) :
  {
    *(Nemagfx_Framebuffer Nemagfx_Framebuffer.*)
    *(.gnu.linkonce.r.*)
    . = ALIGN(0x8);

    *(Nemagfx_Stencil_Buffer Nemagfx_Stencil_Buffer.*)
    *(.gnu.linkonce.r.*)
    . = ALIGN(0x8);
  } >EXTRAM
  
  UncachedSection (NOLOAD) :
  {
    *(Nemagfx_Memory_Pool_Buffer Nemagfx_Memory_Pool_Buffer.*)
    *(.gnu.linkonce.r.*)
    . = ALIGN(0x8);
  } >RAM_CMD
  
  FontFlashSection :
  {
    *(FontFlashSection FontFlashSection.*)
    *(.gnu.linkonce.r.*)
    . = ALIGN(0x4);
  } >FLASH_GFX

  TextFlashSection :
  {
    *(TextFlashSection TextFlashSection.*)
    *(.gnu.linkonce.r.*)
    . = ALIGN(0x4);
  } >FLASH_GFX

  ExtFlashSection :
  {
    *(ExtFlashSection ExtFlashSection.*)
    *(.gnu.linkonce.r.*)
    . = ALIGN(0x4);
  } >FLASH_GFX
}

MMU config:

static void MPU_Config(void)
{
  MPU_Region_InitTypeDef MPU_InitStruct = {0};

  /* Disables the MPU */
  HAL_MPU_Disable();

  /* Disables all MPU regions */
  for(uint8_t i=0; i<__MPU_REGIONCOUNT; i++)
  {
    HAL_MPU_DisableRegion(i);
  }

  /** Initializes and configures the Region and the memory to be protected */
  MPU_InitStruct.Enable = MPU_REGION_ENABLE;
  MPU_InitStruct.Number = MPU_REGION_NUMBER0;
  MPU_InitStruct.BaseAddress = 0x00000000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_4GB;
  MPU_InitStruct.SubRegionDisable = 0x87;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL0;
  MPU_InitStruct.AccessPermission = MPU_REGION_NO_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /**
   * Initializes and configures the Region and the memory to be protected
   * XSPI2 Range 0x70000000 - 0x7FFFFFFF, Size 0x08000000 = 128MB
   */
  MPU_InitStruct.Number = MPU_REGION_NUMBER1;
  MPU_InitStruct.BaseAddress = 0x70000000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_128MB;
  MPU_InitStruct.SubRegionDisable = 0x0;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_BUFFERABLE;

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /**
   * Initializes and configures the Region and the memory to be protected
   * XSPI2 Range 0x70000000 - 0x70200000, Size 0x00200000 = 2MB
   */
  MPU_InitStruct.Number = MPU_REGION_NUMBER2;
  MPU_InitStruct.BaseAddress = 0x70000000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_2MB;
  MPU_InitStruct.SubRegionDisable = 0x0;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_ENABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_BUFFERABLE;

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /**
   * Initializes and configures the Region and the memory to be protected
   * XSPI1 Range 0x90000000 - 0x91FFFFFF, Size 0x02000000 = 32MB
   */
  MPU_InitStruct.Number = MPU_REGION_NUMBER3;
  MPU_InitStruct.BaseAddress = 0x90000000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_32MB;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /**
   * Initializes and configures the Region and the memory to be protected
   */
  MPU_InitStruct.Number = MPU_REGION_NUMBER4;
  MPU_InitStruct.BaseAddress = 0x20000000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_64KB;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /**
   * Initializes and configures the Region and the memory to be protected
   * RAM Range 0x24000000 - 0x24080000, Size 0x00080000 = 512kB
   */
  MPU_InitStruct.Number = MPU_REGION_NUMBER5;
  MPU_InitStruct.BaseAddress = 0x24000000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_512KB;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_BUFFERABLE;

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /**
   * Initializes and configures the Region and the memory to be protected
   * RAM_CMD for GPU2D - Nemagfx_Memory_Pool_Buffer nemagfx_pool_mem 16kB
   * RAM Range 0x2406E000 - 24072000, Size 0x00004000 = 16kB
   */
  MPU_InitStruct.Number = MPU_REGION_NUMBER6;
  MPU_InitStruct.BaseAddress = 0x2406C000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_16KB;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL0;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE; /* For GPU2D - no cache! */
  MPU_InitStruct.IsBufferable = MPU_ACCESS_BUFFERABLE;

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /* Enables the MPU */
  HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);
}

I played with the clocks as well.

Is it possible to send me a zip file with the working test project so I can try to fix the problem that way?

Best regards, Jack.

JBJOE.1 · ‎2025-10-30

Hi @Jack3

Let's try to take a step back and start from one of the NeoChromSDK examples (I tested with version 1.3.0):
start with this:
\NeoChromSDK\Projects\STM32H7S78-DK\Applications\GPU2D\vector_watchface\
Find the file
\NeoChromSDK\Projects\STM32H7S78-DK\Applications\GPU2D\vector_watchface\Src\svg.c

Replace the app_main() function in that file with this:

void app_main()
{

  nema_vg_init_stencil_pool(RESX, RESY, STENCIL_MEM_POOL_ID);

  nema_cmdlist_t cl = nema_cl_create_sized(8 * 1024);
  nema_cl_bind_sectored_circular(&cl,8);
  
  unsigned char green_channel = 0x00;

  while(run)
  {

    img_obj_t *fbo = get_current_framebuffer();
    nema_bind_dst_tex((uintptr_t)fbo->bo.base_phys, fbo->w, fbo->h, fbo->format, fbo->stride);
    
    nema_set_clip(0, 0, RESX, RESY);
    nema_clear(nema_rgba(0xff, green_channel++, 0, 0xff));

    nema_matrix3x3_t m;
    NEMA_VG_PAINT_HANDLE paint;
    paint = nema_vg_paint_create();

    nema_mat3x3_load_identity(m);
  //  nema_mat3x3_translate(m, 0.0, 0.0); // fun to play with 
  //  nema_mat3x3_rotate(m, 0.0); // fun to play with
  //  nema_mat3x3_scale(m, 1.0, 1.0); // fun to play with 

    nema_vg_paint_set_type(paint, NEMA_VG_PAINT_COLOR);
    nema_vg_paint_set_opacity(paint, 1.0);
    nema_vg_paint_set_paint_color(paint, nema_rgba(0xFF, green_channel, 0xFF, 0xFF));
    nema_vg_stroke_set_width(15.0);

    // Choose a fill rule
    nema_vg_set_fill_rule(NEMA_VG_STROKE);
//    nema_vg_set_fill_rule(NEMA_VG_FILL_EVEN_ODD);
//    nema_vg_set_fill_rule(NEMA_VG_FILL_NON_ZERO);
    
    // Basic blend mode setup for now
    nema_set_blend_fill(NEMA_BL_SRC_OVER);
    nema_vg_set_blend(NEMA_BL_SRC_OVER);

    nema_enable_aa_flags(1);

    nema_vg_set_quality(NEMA_VG_QUALITY_MAXIMUM); 

    nema_vg_draw_rect(80, 140, 200, 200, m, paint);
    nema_vg_draw_circle(400, 240, 100, m, paint);
    nema_vg_draw_rounded_rect(520, 140, 200, 200, 40, 40, m, paint);

    nema_cl_submit(&cl);
    nema_cl_wait(&cl);
    
    swap_buffers(); // wait for sync up with the LTDC controller

    nema_vg_paint_destroy(paint);
  }
  nema_cl_unbind();
  nema_cl_destroy(&cl);
}

This example draws your three vg shapes. All the exotic blending operations from previous is reduced to simple NEMA_BL_SRC_OVER. The rending is in sync with the LTDC by calling swap_buffers(). In this configuration you can play with nema_vg_set_quality() and nema_vg_set_fill_rule(). This configuration draws every time on my board.
Can you reproduce this setup?

Regards,
Jakob

Jakob BJOERN
Senior Software Engineer | STM32 Graphics