stm32f405 random hard, usage, bus ore memory-fault

s239955_stm1 · ‎2013-12-05

Posted on December 05, 2013 at 12:17

Hey there,

Our problem is that we get random faults (memory, bus or usage)!

The handlers are calling sporadically between 10s to some hours! But we have also pcb�s without any issues!

We are using a custom environment with a STM32F405ZGT and an external PSRAM!

From software side we have the Keil os without micro lib (due to usage of a cpp library)

We are using excessive dynamic memory but only in external RAM! All other RAM access goes to internal one.

A further interface we are using in a high priority is the SPI2 interface with DMA functionality!

If we stop in an error handler, we recognize invalid addresses in pc (like: 0xE0C001E0; 0x804AED2; 0xFFFFFFFE; 0x46604630 or 0x00000000)

Does anybody have an idea?

#stm32f4 #stm32 #fault

Tesla DeLorean · ‎2013-12-05

Posted on December 05, 2013 at 13:20

Where is your stack situated and how large is it? And how much is getting used?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

s239955_stm1 · ‎2013-12-05

Posted on December 05, 2013 at 13:52

there are 5 cyclic (5, 10, 10, 20 and 20ms) tasks. Two of them with user defined stack by 8 kbyte and 16 kbyte the others with 2000 byte. In addition there is one task witch just waiting for events (some events are fired about 5 times a second).

From this stacks at most 10% are used.

At now we are using the stack-check feature from the keil os. And there was no strick detected.

We already

tried

to enlarge those stacks but with no luck.

Do you have any other suggestions?

Tesla DeLorean · ‎2013-12-05

Posted on December 05, 2013 at 18:19

Where, as in which memory space, does the stack reside?

Check the timings of the external memory.

Check the PLL, regulator and flash settings for the speed you are running the processor.

Check the speed the CPU is actually running.

Do you clear your dynamic allocations?

Does the call tree indicate any commonality in the failure?

Does the debugging/telemetry output indicate any commonality?

Do you have a hardware trace pod?

What other postmortem analysis have you done in these situations?

What do the other user registers and processor/core registers indicate at failure?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

s239955_stm1 · ‎2013-12-06

Posted on December 06, 2013 at 10:09

>Where, as in which memory space, does the stack reside?

The stacks are located to the internal RAM. Also the buffers for DMA.

>Check the timings of the external memory.

>Check the PLL, regulator and flash settings for the speed you are running the processor.

We already made some tests with the wait states of the internal Flash memory. We use the controller at 3.3 volt and 168MHz, so we need the 5 wait states as minimum configuration. We enlarge this with no better luck. Also the enabling and disabling of prefetching and cache brought no other result. Or do you haven a combination witch we shall try? The actual values are: FLASH->ACR=0x00000705 and PWR->CR=0x00001C00.

>Check the speed the CPU is actually running.

We checked the clock and it is stable. We also tried other external oscillators and we clocked the controller from internal clock-source. But there was no effect on the failure.

>Do you clear your dynamic allocations?

Don’t know what you mean with this point? The heap in the external PSRAM is exclusively used by a static cpp-library.

>Does the call tree indicate any commonality in the failure?

>Does the debugging/telemetry output indicate any commonality?

>Do you have a hardware trace pod?

>What other postmortem analysis have you done in these situations?

>What do the other user registers and processor/core registers indicate at failure?

In the past, we look for the source of the failure since about two month, we had all faults the controller provide. There was instruction, memory failures and many others reported in the SCB-Registers. We use the published diagnostic-code for the fault-interrupts to get the PC and other Registers at failure. For example the PC contains in most cases other addresses and there was nearly no possibility to follow the way back to get the fault-address. But there were commonalities, too.

At now we are working on a device (HW and SW) were we can trace. But some of the Trace-Lines are part of the address-lines we are using for the external PSRAM. I guess there will be a fist result of post-mortem analyses at Monday.

At trace and post-mortem analyses: Do you have any suggestions what we have to look for?

Do you have experiences with stm32-systems which are operate at nearly full capacity?

Attached I add some pictures which are made at failure.

________________

Attachments :

Reg.pdf : https://st--c.eu10.content.force.com/sfc/dist/version/download/?oid=00Db0000000YtG6&ids=0680X000006I0ln&d=%2Fa%2F0X0000000bdS%2F5knjtDsfMxrIO0Xy0naIwsquIW.bPc3GHfIUatHy3zI&asPdf=false

frankmeyer9 · ‎2013-12-06

Posted on December 06, 2013 at 10:42

Have you tried to run the system at lower clock frequencies ?

You description doesn't exactly point to a 'systematic' reason.

s239955_stm1 · ‎2013-12-06

Posted on December 06, 2013 at 11:05

We will try to run the system at 130MHz. Results follows asap..

..

Do you have more suggestions how to get to the core of the failure? Maybe

some

hardware-tests

we

had

yet not

thought

about

?

frankmeyer9 · ‎2013-12-06

Posted on December 06, 2013 at 11:25

Actually, I haven't stressed a (STM32) system to the limit yet.

Other thoughts : since the current consumption usually rises with core frequency, you could have issues with the power supply, too.

But, to be honest, I don't have such detailed knowledge of the STM32 core as clive, for instance. However, I would try to lower the core clock below 25MHz. This is, AKAIK, the 'magic' frequency limit, which requires Flash wait states, and the ART prefetch mechanism when exceeded.

It might not turn out helpful, but the symptoms of your issue don't suggest a clear cause.

Not that it's helpful at that stage, but I would have avoided designing such a heavy-loaded system. My experiences with commercial ''go with the cheapest core'' projects are likely bad, involving enormous stress in the final project stage, delivery delays, and maintainance issues (say: 'late bug' field returns).

Perhaps a Cortex R or Cortex A5 would have served you better ...

Tesla DeLorean · ‎2013-12-06

Posted on December 06, 2013 at 13:42

Let's just say I have several decades of experience with micro-processors in all types of systems, and specialize in static and dynamic analysis of code behaviour.

As you've been poking at this for months I'm going to assumes you have reviewed the STM32 errata?

I don't know if your problem is hardware or software, I'd be concerned about the external bus, and would have a logic analyser watching it and triggering on the Hard Fault.

I'd be very concerned about the stacks and DMA, DMA can trash memory you depend on, and do it silently. I'd certainly look to rule out interactions there, and I might do so by putting stacks and other critical data in the CCM memory (0x10000000 64KB). I would enable stack checking, and I'd instrument malloc/free to catch leaks and double-releases.

I'd want to know if the failures correlate with specific interrupt or transfer conditions.

Clearing allocations, malloc() does not clear memory, systems with high utilization of dynamically allocated memory may get very dirty memory. Two tests here are to a) zero all allocations (calloc?), and b) fill with very obtuse data that you can recognize in memory/register dumps and will cause fatal pointer issues where subsequent code fails to initialize incompletely.

I'll review your notes now...

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2013-12-06

Posted on December 06, 2013 at 14:05

I'd be looking at the code pointed too by LR, appears some commonalities there, you NEED to understand where the faults are coming from, and the route to get there.

In gui:3 I'm concerned by the R1->PC relationship, you need to understand that.

Understand where the 0x46604631 comes from, could be ASCII but is not a memory address.

This is all output from the debugger, do you have your code generating telemetry from the SWV or USART to understand the flow prior to failure. Do you have fault handler code that output statistics about the failure to a terminal, and that can walk or navigate the stack?

Unless you can contain the failure, and understand what your code is doing, and how it arrived there you will be chasing ghosts. HW trace might help if you can afford it, and is a poster case for it's use, adequate telemetry output is a lot cheaper and you can have multiple systems/people testing it at once.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..