cancel
Showing results for 
Search instead for 
Did you mean: 

STM32MP157C sporadic page faults / RAM issues

NicoH
Associate

Hi,

we use the STM32MP157C with almost the same RAM layout as the STM32MP157F-EV1, except that the SDRAM chips are from Micron (MT41K256M16). The PMIC is also the same as the one on the evaluation board, and the settings therefore are identical. TF-A, optee, u-boot and kernel are active.

On almost all boards, we see sporadic page faults from the kernel (one in 400 warm starts), distributed across the RAM (all entries from different starts):

[    0.000000] BUG: Bad page state in process swapper  pfn:d0cfc
[    0.000000] page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0xd0cfc
--
[    8.840693] BUG: Bad page state in process tar  pfn:f77ae
[    8.840730] page:769addeb refcount:1 mapcount:0 mapping:00000000 index:0x4 pfn:0xf77ae
--
[    9.664655] BUG: Bad page state in process rc  pfn:f70c8
[    9.664691] page:c59745a8 refcount:1 mapcount:0 mapping:00000000 index:0x4 pfn:0xf70c8
--
[    0.000000] BUG: Bad page state in process swapper  pfn:eace6
[    0.000000] page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0xeace6
--
[    7.866187] BUG: Bad page state in process cksum  pfn:f2b28
[    7.866226] page:dc752aeb refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0xf2b28
--
[    0.000000] BUG: Bad page state in process swapper  pfn:d23be
[    0.000000] page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0xd23be
--
[    0.000000] BUG: Bad page state in process swapper  pfn:c7a44
[    0.000000] page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0xc7a44
--
[    6.143085] BUG: Bad page state in process rc  pfn:faf20
[    6.143121] page:8923cfe9 refcount:1073741824 mapcount:0 mapping:00000000 index:0x4 pfn:0xfaf20


On two boards, we had CRC errors when loading the Fit image into RAM; after a restart, the Fit image booted the system with the same image.

One board is more noticeable than others; when page faults occur there, they are distributed across the RAM during startup (this is one startup). But they also occur only in one of 400 starts:

[    0.000000] BUG: Bad page state in process swapper  pfn:c27a1
[    0.000000] BUG: Bad page state in process swapper  pfn:c27b1
[    0.000000] BUG: Bad page state in process swapper  pfn:c2d91
[    0.000000] BUG: Bad page state in process swapper  pfn:c3978
[    0.000000] BUG: Bad page state in process swapper  pfn:c5311
[    0.000000] BUG: Bad page state in process swapper  pfn:c5321
[    0.000000] BUG: Bad page state in process swapper  pfn:c9f91
[    0.000000] BUG: Bad page state in process swapper  pfn:d0401
[    0.000000] BUG: Bad page state in process swapper  pfn:d08a1
[    0.000000] BUG: Bad page state in process swapper  pfn:d0de1
[    0.000000] BUG: Bad page state in process swapper  pfn:d1261
[    0.000000] BUG: Bad page state in process swapper  pfn:d1fc1
[    0.000000] BUG: Bad page state in process swapper  pfn:d2248
[    0.000000] BUG: Bad page state in process swapper  pfn:d2251
[    0.000000] BUG: Bad page state in process swapper  pfn:d3601
[    0.000000] BUG: Bad page state in process swapper  pfn:d38a1
[    0.000000] BUG: Bad page state in process swapper  pfn:d4238
[    0.000000] BUG: Bad page state in process swapper  pfn:d4db1
[    0.000000] BUG: Bad page state in process swapper  pfn:d9038
[    0.000000] BUG: Bad page state in process swapper  pfn:df691
[    0.000000] BUG: Bad page state in process swapper  pfn:dfcc1
[    0.000000] BUG: Bad page state in process swapper  pfn:e2351
[    0.000000] BUG: Bad page state in process swapper  pfn:e2511
[    0.076264] BUG: Bad page state in process swapper/0  pfn:f5901
[    0.077817] BUG: Bad page state in process swapper/0  pfn:f8921
[    0.079658] BUG: Bad page state in process swapper/0  pfn:fc241
[    0.080282] BUG: Bad page state in process swapper/0  pfn:fc2b1



We checked the timing settings for the Micron chips in CubeMx in the advanced settings, but they corresponded to those of the eval board. The resulting device tree is used by TF-A.
We ran DDRUTIL with no noticeable issues. Even on the more noticable board intensive around RAM-address  0xc27a1000. Google's ‘Stresstestapp’ only shows errors if the kernel has already reported page faults. After restarting the hardware, the errors are gone.
VREF_DDR and DDR_CORE look good, even in the event of an error. At least when measured after the error was detected.

Should we check the register settings for the timings independently of CubeMX? Or is what you set in CubeMX okay?
Does anyone else have any tips or ideas for this problem?

 

 

 

Forgot to mention:
Kernel-Version 6.1.28-rt10
openstlinux-6.1-yocto-mickledore-mp1-v23.06.21

 


Thanks,

Nico

3 REPLIES 3
NLE B.1
ST Employee

Hello Nico,

Indeed, your description may be linked to a SDRAM robustness issue. Even if you're closed to the STM32MP157F-EV1 layout, MICRON and NANYA have different internal layouts so it could explain.

To be sure of these assumptions, it could be interesting to evaluate how robust is the DDR sub-system on your board.

We propose to run memtester from linux console as below:

root@stm32mp15-eval-42-45-b0:~# memtester 200M 10                                                                                              
memtester version 4.6.0 (32-bit)                                                                                                               
Copyright (C) 2001-2020 Charles Cazabon.                                                                                                       
Licensed under the GNU General Public License version 2 (only).                                                                                
pagesize is 4096                                                                                                                               
pagesizemask is 0xfffff000                                                                                                                     
want 200MB (209715200 bytes)                                                                                                                   
got  200MB (209715200 bytes), trying mlock ...locked.                                                                                          
Loop 1/10:                                                                                                                                     
  Stuck Address       : ok                                                                                                                     
  Random Value        : ok                                                                                                                     
  Compare XOR         : ok                                                                                                                     
  Compare SUB         : ok                                                                                                                     
  Compare MUL         : ok                                                                                                                     
  Compare DIV         : ok                                                                                                                     
  Compare OR          : ok                                                                                                                     
  Compare AND         : ok                                                                                                                     
  Sequential Increment: ok                                                                                                                     
  Solid Bits          : ok                                                                                                                     
  Block Sequential    : ok                                                                                                                     
  Checkerboard        : ok                                                                                                                     
  Bit Spread          : ok                                                                                                                     
  Bit Flip            : ok                                                                                                                     
  Walking Ones        : ok                                                                                                                     
  Walking Zeroes      : ok                                                                                                                     
 ...

In this example, all the standard test algorithms are executed within an allocated area of 200MBytes on 10 loops.

This will bring a high level of stress from the DDR point of view, which we can't reach with STM32DDRFW-UTIL tool.

Depending on the results, we'll be able to have more information about the robustness itself, but also in which conditions the errors occur. This could help to tune the DDR settings.

We hope this will help you.

Please keep us informed of the results.

Best Regards

Nicolas LB

NicoH
Associate

Hi Nicolas,

thank you for your answer. Just want to keep you informed, we still investigating the issue. Ideas and thoughts are still welcome :)

We have already created several hardware revisions due to minor design errors, and it appears that the first version of the HW is not prone to RAM problems. The memtester here runs without error on 10 loops.
But on newer revision there are problems:

root@xmaster5:/mnt/data# ./memtester 200M 10
memtester version 4.6.0 (32-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffff000
want 200MB (209715200 bytes)
got  200MB (209715200 bytes), trying mlock ...locked.
Loop 1/10:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : testing   0FAILURE: 0x00000000 != 0x00000001 at offset 0x020b7344.
FAILURE: 0x00000000 != 0x00000001 at offset 0x020b734c.
FAILURE: 0x00000000 != 0x00000001 at offset 0x020b7354.
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : testing  32FAILURE: 0x00000011 != 0x00000010 at offset 0x01ebb04c.
  Walking Ones        : ok
  Walking Zeroes      : testing  23

In addition, we have encountered problems with the PMIC in the newer revisions. On some boards the PMIC only supplies 0.9V on VDDCORE at startup, but according it's registers should supply 1.2V. ALTERNATE mode should be off, still investigating here.

However, there have been no changes in hardware design on these paths, neither to the RAM nor to the PMIC.

Whether PMIC and RAM are related to each other is also unclear to us at the moment.

And last but not least we are in contact with the manufacturer of the PCB and have now heard that there was a change of it's supplier between the first and newer revisions of the board.

 

Will keep you informed.

 

Best regards,

Nico

Hello Nico,

Reading your last message make us think that the first thing you have to investigate on is the PMIC sub-system. Observing this 0.9V on VDDCORE is not normal as you can imagine. This could be a layout issue, but also a too high voltage accidently applied to supply the PMIC (maximum without damage is 6V).

When this first part is totally understood, then it will be time to focus on DDR. VDDCORE is a pilar for the DDR sub-system, it has to be at expected value to make DDR sub-system (and its PLL) working correctly.

Your memtester's logs show a weakness on data bit 0 where failures seem present.

It could be interesting to contact the PCB supplier and confirm that data single impedances agree with specifications, i.e. @ 55ohms (+/- 10%).

These are some tracks to follow up, with possible further steps, but the "road" can depend on the progress on your side also within the different investigations you are presently doing.

Please keep us informed yes.
Best regards

Nicolas