STM32H7 JPEG encoder MCU blocks - corrupted JPEG image

rob-bits · ‎2025-06-16

I have an issue to understand the MCU blocks for JPEG encoder. I have an stm32h743 MCU which is connected to a video decoder with dcmi interface. In ram I have the captured 8-bit ITU-R BT.656 YCrCb 4:2:2 output. Saving the captured data from ram to a debug file, I can see the image is captured properly. The byte stream looks like: Cb0-1, Y0,Cr0-1, Y1, Cb2-3, Y2, CR2-3, Y3 .... total 720Y and 360CB and 360 CR, total 1440 bytes + 8 byte blanking per line. And I have 240 lines so the resolution that I am capturing is 720x240.

As described in the AN4996, each 4:2:2 MCU contains 256 bytes organized as two 8×8 Y blocks plus one 8×8 Cb block plus one 8×8 Cr block. For a 720×240 image, I should need 45 horizontal MCUs × 30 vertical MCUs = 1,350 total MCUs. My confusion is the two Y blocks. I implemented the two Y blocks as:

Y1 first row:
y0, y1 ... y7
Y1 second row:
y16, y17...y23
Y2 first row:
y8, y9, ... y15
Y2 second row:
y24, y25 .. y31
Is this correct?

So Y1 contains the first 8 pixel columns, and the Y2 contains the second 8 pixel columns?

Here is my code that processes the ycrcb byte stream:

#define MCU_BLOCK_ROWS 8  // 8x8 block = 64 bytes
#define MCU_BLOCK_COLS 8  // 8x8 block = 64 bytes
#define MCU_BLOCK_SIZE (MCU_BLOCK_ROWS * MCU_BLOCK_COLS)  // 8x8 block = 64 bytes
#define MCU_SIZE 256
__attribute__((section(".axiram"))) static uint8_t mcuBuffers[2][MCU_SIZE];

static void ExtractMcuToBuffer(uint8_t* src, uint8_t* dest, uint32_t block_cnt) {
    uint8_t* y1 = dest;        // Y block 1
    uint8_t* y2 = dest + 64;   // Y block 2
    uint8_t* cb = dest + 128;  // Cb block
    uint8_t* cr = dest + 192;  // Cr block
  // 720x240 → 720 / 16 bytes, 240/8 bytes -> 45x30 = 1350 MCUs
    uint32_t mcuRowIdx = block_cnt / 45;
    uint32_t mcuColIdx = block_cnt - (mcuRowIdx * 45);
    //the number of bytes in one line, left offset + right offset + width = 1576,
    // 4 byte was 1 pixel with this offset
    uint32_t lineOffset = 1448; //how much bytes are in a line -> determined by trial and error, this is how many samples are captured by dcmi, including blanking
    // Optimized MCU extraction for 4:2:2 UYVY format
    uint32_t rowIdx = 0;
    uint32_t colIdx = 0;
    uint32_t byteOffset = (lineOffset * MCU_BLOCK_ROWS * mcuRowIdx ) //we need to skip MCU_BLOCK_ROWS of lines times mcuRowIdx
    		+ (lineOffset * 0) //we need to do offset lines based on the row idx, in the begining we are at 0 line, for second line of MCU we need to get the second line of the ycrcb buffer
			+ mcuColIdx * 32; // we need an offset based on in which MCU block we are working on, 0, 1... One MCU block consist of 32 bytes,16xY, 8xCr and 8xCb 32 x 45 -> 1440 bytes, max(mcuColIdx) = 44, 32 * 44 + 32 -> 1440 samples
    uint32_t* crycby = (uint32_t*)&src[byteOffset];
    for(int i = 0; i < MCU_BLOCK_SIZE;i++) {
      
      uint8_t Y1 = (uint8_t) ((*crycby & 0x000000FF) >> 0);
      uint8_t Cr = (uint8_t) ((*crycby & 0x0000FF00) >> 8);
      uint8_t Y2 = (uint8_t) ((*crycby & 0x00FF0000) >> 16);
      uint8_t Cb = (uint8_t) ((*crycby & 0xFF000000) >> 24);
      crycby++; //increase four bytes in the address
      *cr++ = Cr;    // cr
      *cb++ = Cb;    // cb

      //building Y1
      if(colIdx < MCU_BLOCK_COLS/2) {
          *y1++ = Y1;     //yn
          *y1++ = Y2;    // yn+1
      } else {//building Y2
          *y2++ = Y1;     //yn
          *y2++ = Y2;    // yn+1
      }
      if(colIdx < MCU_BLOCK_COLS - 1) {
        colIdx++;
      } else { //here we switch line
        colIdx = 0;
        rowIdx++;
        byteOffset = (lineOffset * MCU_BLOCK_ROWS * mcuRowIdx ) +  rowIdx * lineOffset + mcuColIdx * 32;
        crycby = (uint32_t*)&src[byteOffset];
      }
    }
}

The block_cnt goes from 0 to 1349.

With this code, I got this jpeg image:

The file header, resolution looks okay, but the content is corrupted. And I am not sure why.

Any idea?

rob-bits · ‎2025-06-18

I found a solution. Basically my implementation was correct. The example code brings too much complexity, it is not easy to integrate... Anyway, the issue that I was facing is related to cache issue with dma. I had to clean the dchache each time I created an MCU block. Something like this:

	ExtractMcuToBuffer(inputPtr, mcuBuffers[bufferIndex], currentMcu);
	// Critical: Clean cache after buffer generation
	SCB_CleanDCache_by_Addr((uint32_t*)mcuBuffers[bufferIndex], MCU_SIZE);

View solution in original post

Saket_Om · ‎2025-06-17

Hello @rob-bits

Please refer to the JPEG example below:

STM32CubeN6/Projects/STM32N6570-DK/Examples/JPEG/JPEG_EncodingFromOSPI_DMA at main · STMicroelectronics/STM32CubeN6 · GitHub

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.
Saket_Om

rob-bits · ‎2025-06-18

Hello @Saket_Om

Thanks, I have already tried to interpret the example codes for my case. However in the JPEG_Encode_DMA() funciton, the MCU blocks are created with pRGBToYCbCr_Convert_Function(), which might call the JPEG_ARGB_MCU_YCbCr422_ConvertBlocks() fun. However I have YCrCb data. I do not have RGB. I do not want to do any conversion, just encode it into JPEG. As I understand properly, the YCrCb is the format that is needed for JPEG. So please guide me, how to resolve this issue. Do you have an example/tutorial for a YCrCB 4:2:2 input?

Here is the code that you suggested:

uint32_t JPEG_Encode_DMA(JPEG_HandleTypeDef *hjpeg, uint32_t RGBImageBufferAddress, uint32_t RGBImageSize_Bytes, uint32_t *jpgBufferAddress )
{
  pJpegBuffer = jpgBufferAddress;
  uint32_t DataBufferSize = 0;

  /* Reset all Global variables */
  MCU_TotalNb                = 0;
  MCU_BlockIndex             = 0;
  Jpeg_HWEncodingEnd         = 0;
  Output_Is_Paused           = 0;
  Input_Is_Paused            = 0;

  /* Get RGB Info */
  RGB_GetInfo(&Conf);
  JPEG_GetEncodeColorConvertFunc(&Conf, &pRGBToYCbCr_Convert_Function, &MCU_TotalNb);

  /* Clear Output Buffer */
  Jpeg_OUT_BufferTab.DataBufferSize = 0;
  Jpeg_OUT_BufferTab.State = JPEG_BUFFER_EMPTY;

  /* Fill input Buffers */
  RGB_InputImageIndex = 0;
  RGB_InputImageAddress = RGBImageBufferAddress;
  RGB_InputImageSize_Bytes = RGBImageSize_Bytes;
  DataBufferSize= Conf.ImageWidth * MAX_INPUT_LINES * BYTES_PER_PIXEL;

  if(RGB_InputImageIndex < RGB_InputImageSize_Bytes)
  {
    /* Pre-Processing */
    MCU_BlockIndex += pRGBToYCbCr_Convert_Function((uint8_t *)(RGB_InputImageAddress + RGB_InputImageIndex), Jpeg_IN_BufferTab.DataBuffer, 0, DataBufferSize,(uint32_t*)(&Jpeg_IN_BufferTab.DataBufferSize));
    Jpeg_IN_BufferTab.State = JPEG_BUFFER_FULL;

    RGB_InputImageIndex += DataBufferSize;
  }
...

You can see, it is for RGB images.

Thanks!

Rob

rob-bits · ‎2025-06-18

I found a solution. Basically my implementation was correct. The example code brings too much complexity, it is not easy to integrate... Anyway, the issue that I was facing is related to cache issue with dma. I had to clean the dchache each time I created an MCU block. Something like this:

	ExtractMcuToBuffer(inputPtr, mcuBuffers[bufferIndex], currentMcu);
	// Critical: Clean cache after buffer generation
	SCB_CleanDCache_by_Addr((uint32_t*)mcuBuffers[bufferIndex], MCU_SIZE);