The output of a relative big model is different even if giving the exact same input

Einstein_rookie_version · ‎2025-06-10

Hello,

I'm doing my machelor thesis, and I found a very interesting phenomenon. I deployed a Yolov8n from model zoo inside the N6 Nucleo Board NPU(by modifying the template project from the DK board and Input from the UART with CRC code to make sure the picture is send correctly.) the inference seems to be right. However, when I input the same data (same picture with same crc code), the output can be different!!???? Is this phenomenon right? or its because I messed up sth. when I change the project???????

Also even if I deploy this Yolov8n inside CPU without NPU, the outcome is still different when giving the same input. (the reference seems also working correctly.) Is this also correct??? any explanation or documentation for this?

(the mAP can have a 20% different when giving 10 same pictures)

Also when using a mobilenet v2 the same phenomenon is also found both in CPU and NPU,( top 5 class is same ,but the confidence can have a difference up to about 20 %)

please give me some ideas whether this is right or wrong, when its wrong, what should I do to fix this? when right, can you give me some documentations about this?

Thanks a lot in advance.

SlothGrill · ‎2025-06-17

Hello,
Thanks for the update.

Could you tell us a bit more about how/when/what you do to do cache maintenance ?

Most likely, if you do an inference using NPU, what you would need to do is something like:

Retrieve your image from UART, and perform your sanity checks (be careful, at this time, the image -or a part of it- will most likely stay in the cache, and not be written to memory).
CLEAN CACHE (i.e. flush the cache to the memory) ... (but do not invalidate it before cleaning, as shown in your snippets !)
start the inference.

software epoch controller may rely on DCACHE for acceleration

This is not true, the epoch controller does not have access to CPU-D$: data is fetched by DMAs that are independent from the CPU (stream engines of the NPU).

Could you try to clean & invalidate, but not invalidate & clean and tell us if this fixes your issues ?

View solution in original post

SlothGrill · ‎2025-06-17

Hello @Einstein_rookie_version,

Though very interesting, this behaviour is not expected, as you would assume :)

Seemingly random behaviours like this are often caused by issues when using the caches present in the system i.e. bad cache maintenance.

Could you check the behaviour of your code when you disable the MCU data cache ? If this works better, then you may want to dig into that and ensure that the MCU cache is properly "flushed" (cleaned) before handling inference to the NPU. (otherwise the image may be in the the mcu data cache, but not written into memory... and the npu does not access the mcu datacache, but the physical memory)

This, however, does not fully explain the issue if you have the same issue when using only MCU inference... How did you do the inference "using MCU only ?"

Sorry, this may sound silly but could you double check that your program does not overwrite the weights of the network between inferences (i.e. a buffer overflow when doing uart transmissions for example), or that the memory used for storing activations is not messed up with during inference ?

Should those two possible tracks be unfruitful, could you please share "minimal" working examples (but with a buggy inference) of your two projects ? (CPU only / with NPU)

Thanks !

Einstein_rookie_version · ‎2025-06-17

Thanks a lot for the reply, I also tried to explore more these days. I can narrow down to the D cache as the cause of this phenomenon.

The reason for the CPU model inference, is because of

// SCB_InvalidateDCache_by_Addr(...);

// SCB_CleanDCache_by_Addr(...);

these similar lines before and after the reference.

After I deleted these, the randomness is gone.

For the NPU Project, it is similar, the after stop using the D cache, randomness is gone. But with

// SCB_InvalidateDCache_by_Addr(...);

// SCB_CleanDCache_by_Addr(...);

and related code (as recommended in the template project for DK board)

it never have a stable outcome.

deleted these lines, also random outcome.

Only disable D cache, no randomness, but way more slower.

Here is the original letter I send to my professor, hopefully it will help.

A update regarding the previous phenomenon. After making more test, I think I found the main reason.Here are the observations from me.

Related Code:

/* Enable ICACHE */

MEMSYSCTL->MSCR |= MEMSYSCTL_MSCR_ICACTIVE_Msk;

/* DCACHE management */

// MEMSYSCTL->MSCR |= MEMSYSCTL_MSCR_DCACTIVE_Msk;

// SCB_EnableDCache();

// SCB_InvalidateDCache_by_Addr(...);

// SCB_CleanDCache_by_Addr(...);

1. YOLOv8n on NPU

Note: The NPU cache itself appears unrelated — it always gives stable results.
The issue seems to stem from CPU-side D cache (DCACHE).

Conclusion: Enabling DCACHE leads to output inconsistency in NPU, regardless of whether cache invalidation or cleaning is applied.

Personally, I suspect the inconsistency arises because the software epoch controller may rely on DCACHE for acceleration, but the NPU operates faster than the cache system, potentially leading to race conditions or stale data being used.

Without DCACHE: inference is slower but stable.
With DCACHE: faster but results are unpredictable.

Interestingly, enabling the D-cache is actually recommended in ST’s own template project for the DK board. Therefore, I suspect that even ST might not be aware that this could introduce randomness in the inference results.

2. YOLOv8n on CPU

Only D-cache management (cache invalidation and cleaning, before and after inference) affects output. Specifically:

SCB_InvalidateDCache_by_Addr(in_data, ...);

SCB_CleanDCache_by_Addr(in_data, ...);

SCB_InvalidateDCache_by_Addr(out_data, ...);

Note: A HAL_Delay was added after each SCB cache operation to test whether the inconsistency was caused by a delay in the completion of the Clean or Invalidate processes.However, its not the reason.

Conclusion: Applying SCB D-cache management seems to disrupt cache content, possibly interfering with model output consistency. Without these operations, the cache appears to work well and produces constant results.

3. Other Models

A similar issue was observed with MobileNet, though the model is much smaller. Even without DCACHE, inference time remains low (~2 ms), so the inconsistency is less of a concern there.

SlothGrill · ‎2025-06-17

Hello,
Thanks for the update.

Could you tell us a bit more about how/when/what you do to do cache maintenance ?

Most likely, if you do an inference using NPU, what you would need to do is something like:

Retrieve your image from UART, and perform your sanity checks (be careful, at this time, the image -or a part of it- will most likely stay in the cache, and not be written to memory).
CLEAN CACHE (i.e. flush the cache to the memory) ... (but do not invalidate it before cleaning, as shown in your snippets !)
start the inference.

software epoch controller may rely on DCACHE for acceleration

This is not true, the epoch controller does not have access to CPU-D$: data is fetched by DMAs that are independent from the CPU (stream engines of the NPU).

Could you try to clean & invalidate, but not invalidate & clean and tell us if this fixes your issues ?

Einstein_rookie_version · ‎2025-06-17

Thanks a lot for the help, this is really really helpful. By only cleaning before the reference and then invalidate after the reference, the randomness is gone. Thanks a lof again for the excellent solution and explanation!!!!!!!