How to use HASH with DMA in byte transfer mode on STM32H5: why DMA DATA PACK matters

lobna · ‎2026-05-13

Introduction

The STM32H5 microcontroller family integrates hardware HASH peripheral capable of computing MD5, SHA-1, SHA-224, and SHA-256 digests, which can be efficiently supplied with data using DMA.

In many real-world applications, input data is naturally handled as a byte stream (for example, network packets, encrypted payloads, or TLV structures). A common approach is to configure the DMA for byte transfers, using a byte-wide source and the HASH_DIN register as a 32-bit word destination. However, if the DMA data width and HASH data packing are not configured correctly, the transfer may still complete successfully, but the resulting digest can be incorrect.

In particular:

When the DMA is configured for word-to-word transfers and both source and destination addresses are word-aligned, the HASH results are typically correct.
Issues may arise when the source data width differs from the 32-bit HASH input register HASH_DIN, especially if data packing is not configured correctly.

This article explains the root cause of this behavior and provides guidance on how to configure DMA and HASH data packing correctly on STM32H5 to ensure valid digest computation.

Introduction
1. Context and problem description
1.1 Problem reproduction
1.2 Symptom
2. Root cause: relation between DMA data size and HASH word-based input
2.1 How the HASH engine processes its input
2.2Why the digest differs between CPU and DMA paths
3.Solution:Enforcing correct packing in the DMA
3.1 Step 1: Decide what byte order you want at the HASH input
3.2 Step 2: Configure GPDMA to pack bytes into words
3.3 Step 3: Align the HASH configuration with the chosen packing (endianness)
3.4 What you must respect
Related links

1. Context and problem description

1.1 Problem reproduction

A typical configuration on STM32H5 is:

HASH peripheral configured for a given algorithm (for example, SHA‑256).
DMA configured to move data from memory to the HASH input register.
The application provides a uint8_t * input buffer and a size expressed in bytes.

From the application perspective, this approach is attractive because:

It works naturally with byte‑addressed buffers.
It appears to remove any constraint on data alignment or word boundaries.

A common (but incorrect) way to use DMA in this context is to rely on the fact that the message is a sequence of bytes. Additionally, assuming that the DMA + HASH combination implicitly packs these bytes into 32‑bit words as needed by the HASH engine. In practice, the CPU path uses a single 32-bit write, while the DMA path generates multiple 32-bit writes from bytes. As a result, they do not produce the same input word sequence for the HASH core.

The figure below (CPU vs DMA transfer) illustrates this behavior using a simple 4‑byte input buffer [B3 B2 B1 B0]:

When the CPU performs a single 32‑bit write to HASH_DIN, the four bytes are presented to the HASH as one 32‑bit word [B3 B2 B1 B0], and the digest h₁ is computed from this expected layout.
When DMA is used without any explicit packing, each byte B0, B1, B2, B3 is effectively sent as a separate 32‑bit word (0x000000B0, 0x000000B1, …), so the HASH core sees a completely different message and computes a different digest h₂.

Even though the original byte values are the same and the DMA transfer completes without error, the mismatch in byte‑to‑word packing leads to h₁ ≠ h₂

1.2 Symptom

Under this kind of configuration, the HASH transfer:

Starts correctly.
Completes apparently without error at DMA level.
But the computed digest is not correct, even though:

The input buffer content is correct.
The total size (in bytes) is correct.
The HASH algorithm configuration is correct.

This is particularly confusing because:

The message is identical at the byte level.
Both the CPU and DMA are writing to the same HASH_DIN register.
The DMA does not trigger any fault or AHB error.

However, the HASH engine is not receiving the same sequence of 32‑bit input words in the two cases, hence the different digest values.

2. Root cause: relation between DMA data size and HASH word-based input

2.1 How the HASH engine processes its input

Internally, the HASH core processes the message as a sequence of 32‑bit words. Conceptually, for a 4‑byte chunk (B3, B2, B1, B0) of the message, the core expects a single 32‑bit word (for example, [B3 B2 B1 B0], depending on endianness and word-swapping configuration).

On STM32H5:

The HASH input is exposed as a 32‑bit write‑only register connected to an internal FIFO.
The HASH AHB interface is defined for 32‑bit word writes.
There is no automatic generic “DMA data packing” that would regroup arbitrary byte writes into correctly packed 32‑bit words.

When data is written by the CPU as a single 32‑bit word, the relationship between the four bytes and the resulting word is explicit and well‑defined. In contrast, DMA may process the message as bytes without a consistent packing strategy. In that case, each byte can end up as a separate 32-bit word written to HASH_DIN.

2.2 Why the digest differs between CPU and DMA paths

The digest difference comes from the fact that the HASH engine sees two different messages:

CPU path

One 32‑bit write: HASH_DIN = [B3 B2 B1 B0].
The logical message seen by HASH is [B3 B2 B1 B0].

DMA path without packing
- Four 32‑bit writes:
  - HASH_DIN = 0x000000B0,
  - HASH_DIN = 0x000000B1,
  - HASH_DIN = 0x000000B2,
  - HASH_DIN = 0x000000B3.

The logical message seen by HASH is [00 00 00 B0, 00 00 00 B1, 00 00 00 B2, 00 00 00 B3].

Although the original input buffer [B3 B2 B1 B0] is identical in both cases, the sequence of 32‑bit words processed by the HASH core is not. Since the HASH algorithm operates on these words, not directly on the conceptual byte stream, the resulting digest h₂ is different from h₁.

In other words:

The problem is not that bytes are lost or corrupted,
The problem is that each byte is packed into words differently depending on whether the CPU or DMA is used,
Without explicit and consistent byte‑to‑word packing, CPU, and DMA paths cannot be expected to produce the same digest for the same byte buffer.

The next sections of this article shows how to configure the data path so that:

bytes are packed into 32‑bit words in a controlled way, and
the HASH input sequence is identical between CPU and DMA usage, ensuring h₁ = h₂.

3. Solution: Enforcing correct packing in the DMA

The root cause of the issue is that the CPU path and the DMA path do not present the same 32‑bit words to the HASH peripheral:

CPU: one 32‑bit write → the HASH sees a single word like [B3 B2 B1 B0].
DMA (without programmed data handling): each byte is sent as a separate 32‑bit word (0x000000B0, 0x000000B1, …) → the HASH sees a completely different message, so the digest is different.

To solve this, we must ensure that the GPDMA packs the bytes into 32‑bit words in the same way a CPU 32‑bit write would. On STM32H5, this is done with the Programmed Data Handling feature, controlled by the PAM, SB, DB and DH bits of GPDMA_CxTR1 (see RM0481, “Programmed data handling”).

The idea is:

We have a problem: CPU and DMA generate different 32‑bit input words for HASH.
We fix it by using PAM so that DMA performs the desired byte → word packing.
We then choose PAM/SB/DB/DH according to our intended byte order and the system endianness, so that the destination data stream is exactly what we want.

3.1 Step 1: Decide what byte order you want at the HASH input

Before configuring the DMA, you must decide what you want the HASH engine to see, in terms of byte order inside each 32‑bit word.

For a 4‑byte group B3, B2, B1, B0, typical options are:

Word W0 = [B3 B2 B1 B0].
Word W0 = [B0 B1 B2 B3].

Depending on:

Your protocol or application convention.
The endianness of your reference implementation.
The HASH_DATATYPE (data type / swap) setting.

This convention must be:

Consistent with your test vectors (golden hashes).
The same for both the CPU and DMA paths (they must feed identical words to HASH_DIN).

Once clarified, consult the Programmed data handling table in the DMA section of your microcontroller’s Reference Manual. This helps you to determine the combination of SDW_LOG2, DDW_LOG2, PAM[1:0], SB, DB, and DH needed to obtain the required destination data stream.

3.2 Step 2: Configure GPDMA to pack bytes into words

For the typical case discussed in this article:

Source data width: Byte (SDW_LOG2 = 00).
Destination data width: Word (DDW_LOG2 = 10).

The “Programmed data handling” table gives, for each combination of:

PAM[1:0] (RA/OP, RA/SE, PACK, etc.)
SB (source byte reordering).
DB and DH (destination byte/half‑word exchange).

The destination data stream may therefore appear as [B7, B6, B5, B4, B3, B2, B1, B0] or as [B3, B2, B1, B0], depending on the use case.

Instructionally:

Look at the rightmost column and locate the pattern that matches the destination byte order you want (including any padding zeros).
Read the corresponding PAM[1:0], SB, DB, DH values.
Program those bits into GPDMA_CxTR1.

Conceptual example (to be aligned with your actual convention):

You want the GPDMA to take four consecutive bytes and pack them into [B3 B2 B1 B0], matching a 32‑bit CPU write.
In the SDW = byte, DDW = word part of the table, you select the row where the destination data stream is exactly B3, B2, B1, B0 (or the equivalent for your endian convention).
You then program PAM, SB, DB, DH accordingly.

In code, after HAL_DMA_Init() you typically do something like:

/* Positions and masks from RM / device headers */
#define GPDMA_CTR1_PAM_Pos   11U
#define GPDMA_CTR1_PAM_Msk   (0x3U << GPDMA_CTR1_PAM_Pos)

/* Example encoding – adapt to device headers */
#define GPDMA_PAM_PACK    (0x2U << GPDMA_CTR1_PAM_Pos)  // 10: PACK


/* Configure PAM in GPDMA_CxTR1in HAL_DMA_Init() */

static void HAL_DMA_Init() (void)
{
	 __HAL_RCC_HASH_CLK_ENABLE();

	    /* HASH DMA Init */
	    /* GPDMA1_REQUEST_HASH_IN Init */
	    handle_GPDMA1_Channel0.Instance = GPDMA1_Channel0;
	    handle_GPDMA1_Channel0.Init.Request = GPDMA1_REQUEST_HASH_IN;
	    handle_GPDMA1_Channel0.Init.BlkHWRequest = DMA_BREQ_SINGLE_BURST;
	    handle_GPDMA1_Channel0.Init.Direction = DMA_MEMORY_TO_PERIPH;//DMA_MEMORY_TO_MEMORY;//
	    handle_GPDMA1_Channel0.Init.SrcInc = DMA_SINC_INCREMENTED;
	    handle_GPDMA1_Channel0.Init.DestInc = DMA_DINC_FIXED;
	    handle_GPDMA1_Channel0.Init.SrcDataWidth = DMA_SRC_DATAWIDTH_BYTE;//DMA_SRC_DATAWIDTH_WORD;//
	    handle_GPDMA1_Channel0.Init.DestDataWidth = DMA_DEST_DATAWIDTH_WORD;//DMA_DEST_DATAWIDTH_BYTE;//
	    handle_GPDMA1_Channel0.Init.Priority = DMA_LOW_PRIORITY_LOW_WEIGHT;
	    handle_GPDMA1_Channel0.Init.SrcBurstLength = 4; //1;//4;
	    handle_GPDMA1_Channel0.Init.DestBurstLength =4;// 1;//4;
	    handle_GPDMA1_Channel0.Init.TransferAllocatedPort = DMA_SRC_ALLOCATED_PORT0|DMA_DEST_ALLOCATED_PORT0;
	    handle_GPDMA1_Channel0.Init.TransferEventMode = DMA_TCEM_BLOCK_TRANSFER;
	    handle_GPDMA1_Channel0.Init.Mode = DMA_NORMAL;
	    if (HAL_DMA_Init(&handle_GPDMA1_Channel0) != HAL_OK)
	    {
	      Error_Handler();
	    }
//  Read the current value (if you want to preserve the other bits)
uint32_t reg = GPDMA1_Channel0->CTR1;

// Clear only PAM[1:0]
reg &= ~GPDMA_CTR1_PAM_Msk;

// Set PAM = PACK (byte -> packed into word)
reg |= GPDMA_PAM_PACK;

// Write back to the register
GPDMA1_Channel0->CTR1 = reg;

    __HAL_LINKDMA(hhash, hdmain, handle_GPDMA1_Channel0);

    if (HAL_DMA_ConfigChannelAttributes(&handle_GPDMA1_Channel0, DMA_CHANNEL_NPRIV) != HAL_OK)
    {
      Error_Handler();
    }

    /* USER CODE BEGIN HASH_MspInit 1 */

    /* USER CODE END HASH_MspInit 1 */

}

If you prefer using CubeMX, the same configuration is expressed in the Data Handling section of the GPDMA channel (as shown below):

Data Handling Configuration: Enable
Data Alignment: Packed at destination data width (for src < dest)
Exchange Source Byte / Destination Half Word / Destination Byte: Disable (or set them according to the byte order you want at destination).

Cube MX then generates:

void HAL_HASH_MspInit   (HASH_HandleTypeDef* hhash)
{
     DMA_DataHandlingConfTypeDef DataHandlingConfig;

    /* USER CODE BEGIN HASH_MspInit 0 */

    /* USER CODE END HASH_MspInit 0 */
    /* Peripheral clock enable */
    __HAL_RCC_HASH_CLK_ENABLE();

    /* HASH DMA Init */
    /* GPDMA1_REQUEST_HASH_IN Init */
    handle_GPDMA1_Channel0.Instance = GPDMA1_Channel0;
    handle_GPDMA1_Channel0.Init.Request = GPDMA1_REQUEST_HASH_IN;
    handle_GPDMA1_Channel0.Init.BlkHWRequest = DMA_BREQ_SINGLE_BURST;
    handle_GPDMA1_Channel0.Init.Direction = DMA_MEMORY_TO_PERIPH;
    handle_GPDMA1_Channel0.Init.SrcInc = DMA_SINC_INCREMENTED;
    handle_GPDMA1_Channel0.Init.DestInc = DMA_DINC_FIXED;
    handle_GPDMA1_Channel0.Init.SrcDataWidth = DMA_SRC_DATAWIDTH_BYTE;
    handle_GPDMA1_Channel0.Init.DestDataWidth = DMA_DEST_DATAWIDTH_WORD;
    handle_GPDMA1_Channel0.Init.Priority = DMA_LOW_PRIORITY_LOW_WEIGHT;
    handle_GPDMA1_Channel0.Init.SrcBurstLength = 4;
    handle_GPDMA1_Channel0.Init.DestBurstLength = 4;
    handle_GPDMA1_Channel0.Init.TransferAllocatedPort = DMA_SRC_ALLOCATED_PORT0|DMA_DEST_ALLOCATED_PORT0;
    handle_GPDMA1_Channel0.Init.TransferEventMode = DMA_TCEM_BLOCK_TRANSFER;
    handle_GPDMA1_Channel0.Init.Mode = DMA_NORMAL;
    if (HAL_DMA_Init(&handle_GPDMA1_Channel0) != HAL_OK)
    {
      Error_Handler();
    }

    DataHandlingConfig.DataExchange = DMA_EXCHANGE_NONE;
    DataHandlingConfig.DataAlignment = DMA_DATA_PACK;
  

  if (HAL_DMAEx_ConfigDataHandling(&handle_GPDMA1_Channel0, &DataHandlingConfig) != HAL_OK)
    {
      Error_Handler();
    }

    __HAL_LINKDMA(hhash, hdmain, handle_GPDMA1_Channel0);

    if (HAL_DMA_ConfigChannelAttributes(&handle_GPDMA1_Channel0, DMA_CHANNEL_NPRIV) != HAL_OK)
    {
      Error_Handler();
    }

    /* USER CODE BEGIN HASH_MspInit 1 */

    /* USER CODE END HASH_MspInit 1 */

}

Key point: with PACK correctly configured, the DMA no longer sends one full 32‑bit word per byte, but packs several bytes into each destination word exactly as described by the table.

3.3 Step 3: Align the HASH configuration with the chosen packing (endianness)

Once the GPDMA is packing bytes correctly, the HASH must interpret the 32‑bit words in the same byte order you decided in step 1.

This is controlled by the HASH data format configuration (for example HASH_DATATYPE and any word/byte swap options).

From an instructional perspective:

Choose your global convention, for example:
“The logical message bytes are in the order B0, B1, B2, B3, … as seen by the HASH algorithm.”
Use the PAM/SB/DB/DH combination that produces this byte order at the destination (see the “Destination data stream” column in the programmed data handling table).
Set HASH_DATATYPE / swap so that the HASH core interprets the bytes inside each 32‑bit word consistently with that convention (no swap, byte swap, etc.).

The goal is that, for a given message:

The logical byte stream seen by the HASH core is identical.
This applies whether the data is written by the CPU or transferred via GPDMA.

3.4 What you must respect

To summarize what must be respected to fix the issue:

Packing responsibility

The GPDMA must be explicitly configured (via PAM/SB/DB/DH or CubeMX Data Handling) to pack bytes into words according to your desired ordering.
Do not rely on default behavior; default “byte→word” can produce one 32‑bit word per byte.

Endianness and intention

Decide clearly what 32‑bit word layout represents your message (for example [B3 B2 B1 B0]).
Use the programmed data handling table to pick the PAM/SB/DB/DH combination that yields this layout at the destination.
Configure the HASH (HASH_DATATYPE, etc.) so that the interpretation of each word matches this layout and your reference vectors.

Consistency CPU vs DMA

CPU 32‑bit writes and DMA transfers must produce the same sequence of 32‑bit words at HASH_DIN.
When this is true, the digest computed with DMA is guaranteed to match the digest computed with pure CPU writes.

By explicitly using GPDMA programmed data handling, you can control packing, not just alignment.
This ensures that the HASH peripheral receives the correct input word stream and resolves the digest mismatch problem.