Memory is corrupted from hUsbDeviceHS.ep_out[1].rem_length (0x2400162c) onwards.

Claegtun · ‎2024-02-08

Hello,

We are using the USB HAL library on our custom board with an STM32H753VIT6. We have an external PHY chip (USB3300) and USB_OTG_HS. At the moment, we are using the USB to receive custom protobuf packets with a bandwidth at around 480 kbps. We have also written some additional overhead for our custom packets.

Everything works mostly fine. However, after a second of running, some of our variables get corrupted. By looking at the build-analyzer in CubeIDE, I found that the RAM, namely the .bss section, was being set to garbage from 0x2400162c onwards. This was hUsbDeviceHS.ep_out[1].rem_length. I am not sure whether this variable is relevant to the problem or it is just a coincidence.

I know that this could be from our own code somewhere that has a memory-leak, but I am wondering if this has been found before. I tried searching online but found nothing. Otherwise, are there any tips specific to STM32 to tracking down the cause?

Thanks,

Claegtun · ‎2024-02-10

I think that I found the issue ... but am still perplexed.

I decided to go ahead and just replace the memcpy with a dumb for-loop. It fixed the initial bug, but the watchpoint triggered again, this time on a completely different memcpy. This one was in a different file that dealt with the outgoing data of the rx_buffer, i.e. the pop() method. At least when the watchpoint was hit, I saw that there was an obvious out-of-bounds bug that caused the memcpy to overflow. At first, I thought that it was a new completely separate bug. (You know how one finds a new unrelated bug after quashing the one before.) So, I fixed it, and everything seemed to work fine.

However, when I reverted the for-loop back to the first memcpy just to see the original bug again, it disappeared, with exactly the same parameters, everything. Even reverting the new bug confirmed that it was the true culprit.

All's well ends well ... but why on Earth did the watchpoint break at the wrong memcpy. As you can see above, the memcpy that was originally hit by the watchpoint had nothing wrong with it, everything was within bounds. But when I replace it with a dumb for-loop, it suddenly breaks at the true faulty memcpy.

At this point, I guess this is more of an issue with the IDE/debugger unless I am completely missing something.

Either way, thank you so much for your help @TDK . I wouldn't have found it without knowing to add a watchpoint.

View solution in original post

TDK · ‎2024-02-08

Probably an out of bounds write in your code. You know what gets corrupted, so it should be relatively quick to find the offending code.

Debug your code and set a hardware watchpoint on memory that gets corrupted. The debugger will stop when it's modified. You can look to see where the code is at that point and determine/fix the faulty code.

To set a hardware watchpoint, add the expression -> right click expression -> add watchpoint...

If you feel a post has answered your question, please click "Accept as Solution".

Claegtun · ‎2024-02-09

Thanks TDK.

I had no idea that one could add a watchpoint. I have learnt something new.

After I set an appropriate watchpoint, I found both the line of source and assembly that seems to overwrite the memory at 0x2400162c. However, I am still just as confused since there seems to be no reason for it to do so.

I understand that this is less of a question but rather a request for guidance since this is my first time that I have had to track down a memory corruption. Any help is much appreciated.

When the watchpoint is hit, the line of source is:

memcpy(&rx_buffer.data[rx_buffer.back], &Buf[0], *Len);

in the CDC_Receive_HS callback where rx_buffer.data is a 2048 byte long array, rx_buffer.back is 1135, and Buf and Len (*Len is 113) are arguments of the callback. Everything seems to be within bounds. Even the contents of Buf seem to be legitmate bytes of our custom packet.

On the other hand, the line of assembly is:

08013662 strb.w r4, [r3, #1]!

where r4 is 0xef, and r3 is 0x24002c06; the latter points to our rx_buffer, pressumingly .data, although it is hard to tell with the IDE's build-analyser not showing members of structs. This seems to be the storing side of the memcpy. So, this also seems correct and within bounds. See the attached screenshot of the locations.

Furthermore, I am pretty sure that this break at the watchpoint is the event of corruption since when I continue the debugger (F8), it keeps hitting the watchpoint, and .rem_length (which is uint32_t) accumulates bytes in little endian order (the corruption seems to be byte-wise) until all four bytes are overwritten and it then hits the hard-fault breakpoint where the corruption has continued past to everything afterwards. I even put watchpoints at variables a bit further ahead to confirm the behaviour. So, it does not seem to be legimate writing of the .rem_length member.

Interestingly, although may not be relevant, the first byte of corruption at .rem_length is either 0xa1 or 05e following a variable number of 0x00. So, sometimes the variables gets filled as 0x...a100, 0x...a10000, 0x...a1, etc. I have not played around much to tell how consistent this is. Furthermore, I am fairly sure that the corrupting bytes are not those of our custom packet, which has a lot of repetition, 0x00s, patterns etc. These bytes seem to be random (of course not truly) and arbitrary.

As a further aside on the IDE, I noticed that when I scroll upwards in the Disassembly view, the blue arrow disappears and the address next to the strb.w line changes to 08013663. See the other attachment. Does this mean that the culprit is something else than the strb.w line? Also, if anyone is wodnering, r1 is 0x24001799 which is the UserRxBufferHS. I wonder whether the push two lines above is the culprit, but the stack-pointer is 0x2407fb60, in ._user_heap_stack and well away from anything above.

Any further tips to find the bug?

Thanks,

TDK · ‎2024-02-09

> memcpy(&rx_buffer.data[rx_buffer.back], &Buf[0], *Len);

If the watchpoint gets hit on that line, that's probably the issue.

I would look at the memory location of rx_buffer.data, value of rx_buffer.back, value of *Len. Put those 3 things in the expression window and screenshot it. Also put the address of the location that gets corrupted, hUsbDeviceHS.ep_out[1].rem_length, and perhaps first few values of Buf.

Definition of rx_buffer.data would also be helpful.

If you feel a post has answered your question, please click "Accept as Solution".

Claegtun · ‎2024-02-09

I already have said those things.

> ... rx_buffer.back is 1135, and Buf and Len (*Len is 113) ...

The locations are also already in one of the screenshots, (one ending in _50). The location of memory that first gets corrupted, i.e. hUsbDeviceHS.ep_out[1].rem_length, is 0x2400162c.

The definition of both the struct and rx_buffer are attached.

TDK · ‎2024-02-10

Hmm, looked at this again but can't figure it out. Feels like we're missing something though. If you're looking at register values, can't see why a variable is one spot is being modified when the register doesn't point to it.

Please post if you solve it.

If you feel a post has answered your question, please click "Accept as Solution".

Claegtun · ‎2024-02-10

Thank you. That's alright. It is quite perplexing.

I might try a new thread, this time asking more about memcpy since we now know that seems to be the line of trouble. I may ask elsewhere as well. I will definitely let you know when I solve it. Otherwise, I might try alternatives to memcpy.

Thanks,

Claegtun · ‎2024-02-10

I think that I found the issue ... but am still perplexed.

I decided to go ahead and just replace the memcpy with a dumb for-loop. It fixed the initial bug, but the watchpoint triggered again, this time on a completely different memcpy. This one was in a different file that dealt with the outgoing data of the rx_buffer, i.e. the pop() method. At least when the watchpoint was hit, I saw that there was an obvious out-of-bounds bug that caused the memcpy to overflow. At first, I thought that it was a new completely separate bug. (You know how one finds a new unrelated bug after quashing the one before.) So, I fixed it, and everything seemed to work fine.

However, when I reverted the for-loop back to the first memcpy just to see the original bug again, it disappeared, with exactly the same parameters, everything. Even reverting the new bug confirmed that it was the true culprit.

All's well ends well ... but why on Earth did the watchpoint break at the wrong memcpy. As you can see above, the memcpy that was originally hit by the watchpoint had nothing wrong with it, everything was within bounds. But when I replace it with a dumb for-loop, it suddenly breaks at the true faulty memcpy.

At this point, I guess this is more of an issue with the IDE/debugger unless I am completely missing something.

Either way, thank you so much for your help @TDK . I wouldn't have found it without knowing to add a watchpoint.