PSSI reception with GPDMA circular linked list

mkrk · ‎2025-09-29

I need to do long (100k+ samples) PSSI reception with GPDMA using linked list on STM32H7R3L8H6H. CubeMX version is v6.15.0 and CubeIDE is v1.19.0. Since single DMA request can't be that long, I am thinking of doing it with two circular linked-list nodes. There are no good examples (at least I didn't find) so I am trying to do with my own logic.

The idea is to transfer data from PSSI peripheral to XSPI1 PSRAM with 8192 byte blocks. When one block finishes and second one starts, I increase the finished block destination address by 8192. Requests execute in circle.

I setup GPDMA channel 12 for the linked list:

And here is the linked list with two nodes N1 and N2 configured exactly the same (except name):

Generated setup function in linked_list.c

HAL_StatusTypeDef MX_PSSI_DMA_LL_Config(void)
{
  HAL_StatusTypeDef ret = HAL_OK;
  /* DMA node configuration declaration */
  DMA_NodeConfTypeDef pNodeConfig;

  /* Set node configuration ################################################*/
  pNodeConfig.NodeType = DMA_GPDMA_LINEAR_NODE;
  pNodeConfig.Init.Request = GPDMA1_REQUEST_PSSI;
  pNodeConfig.Init.BlkHWRequest = DMA_BREQ_SINGLE_BURST;
  pNodeConfig.Init.Direction = DMA_PERIPH_TO_MEMORY;
  pNodeConfig.Init.SrcInc = DMA_SINC_FIXED;
  pNodeConfig.Init.DestInc = DMA_DINC_INCREMENTED;
  pNodeConfig.Init.SrcDataWidth = DMA_SRC_DATAWIDTH_WORD;
  pNodeConfig.Init.DestDataWidth = DMA_DEST_DATAWIDTH_WORD;
  pNodeConfig.Init.SrcBurstLength = 1;
  pNodeConfig.Init.DestBurstLength = 1;
  pNodeConfig.Init.TransferAllocatedPort = DMA_SRC_ALLOCATED_PORT0|DMA_DEST_ALLOCATED_PORT1;
  pNodeConfig.Init.TransferEventMode = DMA_TCEM_EACH_LL_ITEM_TRANSFER;
  pNodeConfig.Init.Mode = DMA_NORMAL;
  pNodeConfig.TriggerConfig.TriggerPolarity = DMA_TRIG_POLARITY_MASKED;
  pNodeConfig.DataHandlingConfig.DataExchange = DMA_EXCHANGE_NONE;
  pNodeConfig.DataHandlingConfig.DataAlignment = DMA_DATA_RIGHTALIGN_ZEROPADDED;
  pNodeConfig.SrcAddress = (uint32_t)(&hpssi.Instance->DR);
  pNodeConfig.DstAddress = 0;
  pNodeConfig.DataSize = 8192;

  /* Build PSSI_DMA_LL_N1 Node */
  ret |= HAL_DMAEx_List_BuildNode(&pNodeConfig, &PSSI_DMA_LL_N1);

  /* Insert PSSI_DMA_LL_N1 to Queue */
  ret |= HAL_DMAEx_List_InsertNode_Tail(&PSSI_DMA_LL, &PSSI_DMA_LL_N1);

  /* Set node configuration ################################################*/

  /* Build PSSI_DMA_LL_N2 Node */
  ret |= HAL_DMAEx_List_BuildNode(&pNodeConfig, &PSSI_DMA_LL_N2);

  /* Insert PSSI_DMA_LL_N2 to Queue */
  ret |= HAL_DMAEx_List_InsertNode_Tail(&PSSI_DMA_LL, &PSSI_DMA_LL_N2);
  ret |= HAL_DMAEx_List_SetCircularModeConfig(&PSSI_DMA_LL, &PSSI_DMA_LL_N1);

   return ret;
}

I call this function manually and I also link GPDMA channel with list:

MX_PSSI_DMA_LL_Config();
HAL_DMAEx_List_LinkQ(&handle_GPDMA1_Channel12, &PSSI_DMA_LL);

Callback registrations:

HAL_DMA_RegisterCallback(&handle_GPDMA1_Channel12, HAL_DMA_XFER_CPLT_CB_ID,  DMATransferComplete);
HAL_DMA_RegisterCallback(&handle_GPDMA1_Channel12, HAL_DMA_XFER_ERROR_CB_ID, DMATransferError);
HAL_DMA_RegisterCallback(&handle_GPDMA1_Channel12, HAL_DMA_XFER_ABORT_CB_ID, DMATransferAbort);

In starting code I set destination address of N1 and N2 node link registers. Link registers are placed into AHB SRAM1 which is configured as non-cacheable with MPU.

Here's the start of DMA:

PSSI_DMA_LL_N1.LinkRegisters[NODE_CDAR_DEFAULT_OFFSET] = (uint32_t)buffer;
PSSI_DMA_LL_N2.LinkRegisters[NODE_CDAR_DEFAULT_OFFSET] = (uint32_t)buffer + 8192;
  
HAL_DMAEx_List_Start_IT(&handle_GPDMA1_Channel12);

If you wonder why access directly - that's what I plan to do in callback and I need to switch buffers a lot of times anyway. Re-building linked list seems a bit too much. For simplicity reasons I skip the HAL return functions check code from forum.

PSSI is enabled by this time by CubeMX generated code. I see correct values in PSSI DR register with debugger.

In DMATransferComplete I have a plan to increase the finished node destination address with a code like this:

  if (node == 0)
  {
    PSSI_DMA_LL_N1.LinkRegisters[NODE_CDAR_DEFAULT_OFFSET] += 8192;
  }
  else
  {
    PSSI_DMA_LL_N2.LinkRegisters[NODE_CDAR_DEFAULT_OFFSET] += 8192;
  }

But I run into question - how do know which node just finished? Is there some status register of GPDMA? Or I just blindly toggle the active node counter with node=1-node ? That doesn't feel robust.

Another, and currently even bigger problem is that the linked list does not circulate. By adding a simple incrementing integer into transfer completion callback and printing it out in the main loop I see only 1 and 2. Using debug breakpoints seems to break transfers, that's why I used non-invasive method. I do not get DMA error callbacks. I have seen data reach RAM, but not always.

Reference manual RM04777 chapter 12.4.3 "GPDMA circular buffering with linked-list programming" is creating more questions. It seems to suggest that only the second node should loop and half-transfer completion interrupts should be enabled. I don't grasp that idea...

Need some advice how to get it working.

The whole thing works with normal single 10k PSSI DMA request, so I know the electronics is okay.

One more question: does linked list guarantee no data loss between item switching?

mkrk · ‎2025-10-06

I didn't get circular mode working. But since there is no limit on the linked-list items count, I generated the items at runtime to perform sequential block transfers. Unfortunately it did not work. And the reason is too high data flow. We sample 16-bits at 100 MHz from PSSI (maximum that it supports) and send that to external Hexa-SPI PSRAM. CPU and system clocks are maxed out. PSSI DMA transfer works for one block, which probably means one continuous DMA request. But it appears that every other transfer on the AHB or AXI bus disturbs the data flow so much that data is lost. In case of linked-list, that disturbance comes from reading the next linked list item registers.

Then I tried 2D DMA with GPDMA channel 12. Basically had all the offsets zero and used repeat count to multiply the block size. Actually I should have used it from the start, I don't know why I even bothered messing with the linked-list. Anyway, now I did and it is almost working.... but this is where things get interesting.

I use 4K blocks (can repeat them up to 2048 times). I have only enabled error and full transfer (FT) interrupts and transfer event is set to block transfer (not repeating block). I use LL, not HAL, to minimize code. Critical functions in ITCM. Caches enabled (with MPU). CPU sits in WFI, nothing else works (at least not aware of). And somehow, I get data loss (values zero or just no data) sampled at 2K boundary. When changing block size, same symptom - problems at half the block address.

Sounds like half-transfer (HT) interrupt is triggering - but it's disabled (masked). But HT flag is still set and that cannot be turned off. So it appears to me that DMA controller takes a few cycles to deal with it's registers or NVIC and at that moment it loses the data due to this extra arbitration. Is it possible? Bad thing is that neither PSSI overrun or any DMA error flag tell me there's data loss. I only see it when analyzing the values. Except during development testing I can feed in pattern and detect data loss, in real application it wouldn't be that simple

Only way I have got it working without dropping data is disabling DMA interrupt completely and just make a good old magical delay for CPU. But it's not a nice solution.

It also doesn't help that PSSI is AHB2 peripheral and GPDMA port 0 shortcut to APB peripherals is not usable. And even though PSSI has FIFO depth of 8 words, it, or the GPDMA doesn't support burst read of PSSI.

Any ideas?

guigrogue · ‎2025-10-06

Hello @mkrk

I don't have much experience with cubemx, I did a linkedlist gpdma with dcmi (closely related to pssi) in circular mode with register some time ago but hope to be helpful here.
To answer some of your questions.
how do know which node just finished?
- The simplest way is to set a counter and increment it on transfer complete interrupt with a block/LLI granularity (GPDMA_CxTR2.TCEM = 0b00 or 0b10, 0b01 also work for 2D) to keep track of the progress of your LinkedList.
Is there some status register of GPDMA?
- GPDMA_CxSR is the status register (17.8.8 in rm0456 you mentionned). The 8th bit - TCF - is set to 1 on transfer complete and you can enable an interrupt on this event on GPDMA_CxCR.TCIE 8th bit. This flag has to be cleared manually using GPDMA_CxFCR.TCF otherwise you can be stuck in an interrupt loop.
Or I just blindly toggle the active node counter with node=1-node ?
- This is roughly how it was made on some other boards with dma double buffering mode. Here you should persevere to do a linkedlist with all your needed parameter known at compile time if possible to avoid a runtime changes in the nodes.

does linked list guarantee no data loss between item switching?
- The delay between 2 nodes is relatively small but can happen, especially in your case for continuous fast transfer as for the PSSI. There are status about Overrun/underrun from the data register (source of your gpdma transfer), when debugging, pay attention to those in the PSSI_RIS (to enable an interrupt, see PSSI_IER) and you will see if you lose data. Having runtime changes in the nodes on transfer complete interrupt could increase the chance of dataloss.

To make the LL circular, there is an "circular mode" option to enable in cubemx system core gpdma1.
As for a basic linkedlist, each node stores the address (the LSBs) of the next one, in the circular mode, you just need to specify the first node address on the last node. Be sure to enable the ULL bit in the GPDMA_CxLLR register. it indicates the hardware there is a next node and enable the system to fetch the address of the next one.

guigrogue · ‎2025-10-06

If you need to update the nodes at runtime, I would advise you to play with the half transfer interrupt and make the changes while the transfer is running in the back to minimise the latency between node transition. The best would be not to change them at runtime to make the node updates much smoother.

The gpdma flags have to be cleared by software, the half transfer flag will set to 1 anyway if you reach a complete transfer.

I've worked up to 96MHz input data (with dcmi) and the issues mostly went from an unstable external clock instead of an internal delay. In any case, you can check the interrupt handler function inside the IRQ. This one can be to heavy and it might be better to overwrite it with something lighter than doesn't check every flag and other things.

Setting a compact linkedlist also limitates the amount of parameter to fetch by the hardware on a transfer complete. Update only the needed parameters (DAR, LLR).

Using PendSV could lighten the latency created by the interrupt by removing some of your code from the IRQ and putting it in the PendSV IRQ. The dma interrupt will finish "sooner", your dma transfer will resume and your code will be then executed in your PendSV IRQ (haven't benchmarked it).

mkrk · ‎2025-10-07

Thanks @guigrogue for ideas. I was trying the things you described but didn't have luck. In the end 2D transfer seemed much simpler. I've used that on H7 with DCMI. Don't know why I overlooked it at first. Probably was relying on the fancy new linked-list DMA feature. But sometimes you just need to keep it simple :)

Good thing is that I was developing timeout timer for DMA and I saw some strange ISR timings. And then I realized that I had made a small, but impactful misconfiguration bug. I thought DMA ISR was executing after all block repetitions are done, but it was executing after every single block. Transfer event must be set to LL_DMA_TCEM_RPT_BLK_TRANSFER. This avoids traffic on internal busses during PSSI -> PSRAM transfer.
Simplified DMA setup code below. Maybe it helps someone else also:

#define DMA_CONTROL       GPDMA1
#define DMA_CHANNEL       LL_DMA_CHANNEL_12
#define DMA_BLOCK_SIZE    4096

uint16_t dstBuffer[0x100000]; // 1 million frames = 2 MB
uint32_t dstBufferSize = sizeof(dstBuffer);

LL_DMA_InitTypeDef DMA_InitStruct = {0};
LL_DMA_StructInit(&DMA_InitStruct);

DMA_InitStruct.SrcAddress               = (uint32_t)(&hpssi.Instance->DR);
DMA_InitStruct.DestAddress              = (uint32_t)dstBuffer;
DMA_InitStruct.Direction                = LL_DMA_DIRECTION_PERIPH_TO_MEMORY;
DMA_InitStruct.Request                  = LL_GPDMA1_REQUEST_PSSI;
DMA_InitStruct.BlkHWRequest             = LL_DMA_HWREQUEST_SINGLEBURST;
DMA_InitStruct.DataAlignment            = LL_DMA_DATA_ALIGN_ZEROPADD;
DMA_InitStruct.SrcBurstLength           = 1;
DMA_InitStruct.DestBurstLength          = 64; 
DMA_InitStruct.SrcDataWidth             = LL_DMA_SRC_DATAWIDTH_WORD;
DMA_InitStruct.DestDataWidth            = LL_DMA_DEST_DATAWIDTH_WORD;
DMA_InitStruct.SrcIncMode               = LL_DMA_SRC_FIXED;
DMA_InitStruct.DestIncMode              = LL_DMA_DEST_INCREMENT;
DMA_InitStruct.Priority                 = LL_DMA_HIGH_PRIORITY;
DMA_InitStruct.TransferEventMode        = LL_DMA_TCEM_RPT_BLK_TRANSFER; /* <-- Have to be this, not LL_DMA_TCEM_BLK_TRANSFER */
DMA_InitStruct.SrcAllocatedPort         = LL_DMA_SRC_ALLOCATED_PORT0;
DMA_InitStruct.DestAllocatedPort        = LL_DMA_DEST_ALLOCATED_PORT1;
DMA_InitStruct.BlkDataLength            = DMA_BLOCK_SIZE;
DMA_InitStruct.BlkRptCount              = (dstBufferSize / DMA_BLOCK_SIZE) - 1;
DMA_InitStruct.SrcAddrOffset            = 0;
DMA_InitStruct.DestAddrOffset           = 0;
DMA_InitStruct.BlkRptSrcAddrOffset      = 0;
DMA_InitStruct.BlkRptDestAddrOffset     = 0;
DMA_InitStruct.SrcAddrUpdateMode        = LL_DMA_BURST_SRC_ADDR_INCREMENT;
DMA_InitStruct.DestAddrUpdateMode       = LL_DMA_BURST_DEST_ADDR_INCREMENT;
DMA_InitStruct.BlkRptSrcAddrUpdateMode  = LL_DMA_BLKRPT_SRC_ADDR_INCREMENT;
DMA_InitStruct.BlkRptDestAddrUpdateMode = LL_DMA_BLKRPT_DEST_ADDR_INCREMENT;

LL_DMA_Init(DMA_CONTROL, DMA_CHANNEL, &DMA_InitStruct);

Now it gets all data without data loss.