Broken Kernel

  • Dear Support-Team,


    we're using the PicoMod7A with kernel 3.3.7 and u-boot 2012.07.


    On three different of about 20 devices (until now) we had the problem, that after a long time (2-3 months) of running without problems the kernel got broken. The devices are crashing and aren't booting any more. When I connect the debug port (COM3) I see:

    Code
    1. NAND read: device 0 offset 0x500000, size 0x300000
    2. 3145728 bytes read: OK
    3. Starting Kernel...


    and nothing more happens. It seems the kernel is copied into RAM, is executed but contains invalid code so nothing happens. After I reload the kernel and FFS from SD-Card everything is working correctly again.


    In the bootargs the kernel is write protected "3m(Kernel)ro,-TargetFS". So I hope it will be impossible to cause overwriting parts of the kernel from an application running in the buildroot FFS.


    Can you give us any advice why this is happening? Where can we start isolating the problem?


    Thanks in advance

  • On which release are your versions based? There were several major improvements in the NAND handling between release V1.1 and V2.0. So if you're still based on V1.1, try updating to V2.0 first. See also this topic. The topic is about armStoneA8, but this is also a board using the MultiPlatform Linux. After the release of V2.0, there were no additional problems reported.


    If you are already on V2.0, then this is a new issue and we have to investigate it further.


    Your F&S Support Team

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • The u-boot was compiled from original sources "u-boot-2012.07-f+s-V2.0". So I think it's V2.0 already.


    In difference to the problem on armstoneA8 I have no line "Uncompressing Linux... done, booting the kernel."


    The kernel file size is 2.841.168 bytes (below 3MB).


    We also thought of ESD problems when touching the housing where the Picomod is assembled. But we double checked that there's no electrical connection between housing and Picomod.

  • This is strange, because if the image could be loaded from NAND without error, it should be correct. Otherwise the NAND loader should have reported ECC errors. How can it be then that the kernel can not start? It seems that the uncompressor in the zImage fails.


    Do you still have a board where the kernel does not start? Then please try the following:


    Load the stored (malfunctioning) kernel from NAND to some RAM address.

    Code
    1. nand read <addr1> Kernel


    Then load the original kernel (via TFTP or from USB-Stick/SD-Card) to a different RAM address.

    Code
    1. tftp <addr2> <orig-kernel>


    Make sure that they don't overlap. Then compare the two with the cmp.b command.

    Code
    1. cmp.b <addr1> <addr2> $(filesize)


    Does it report any differences?

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • URGENT !


    Once again we have a device where the kernel does not start (see above).


    Now I tried to check the kernel with following result:


    [IMG:http://www.passtec.de/tmdownload/kernel.png]


    The content of the nand seems to be the same as the original kernel file.


    If I load the kernel to address 40800000 and run bootz the kernel is starting. If I load from nand it is not starting. I would not believe if I would not have seen it myself.

    Code
    1. tftp 40800000 tlkernel
    2. bootz


    is working

    Code
    1. nand read 40800000 Kernel
    2. bootz


    The kernel is not running although the content seems to be the same.


    What can I do? Urgent help needed! We get back one device after another from our customers!


    Thanks in advance

  • Dear support-team,


    We have more and more PicoMOD7 with broken kernel. I found out a little more what's wrong with the devices. They all have 1 to 3 bit errors when reading back the kernel into RAM. These bit errors are addressed randomly over the whole 2.8 MB and different on all devices. All bit errors have in common, that a bit that should be 0 is read as 1. So there are bytes that changed from 0xCA to 0xDA, 0x6C to 0x6D, 0x32 to 0x33, 0x73 to 0x77, 0x89 to 0x8B and so on. In every faulty device I found 1 to 3 such errors. These devices are working for weeks until this happens. If I reload the kernel or even only "repair" the broken bytes the devices are running for weeks again. The kernel is partitioned read only in u-boot env (see above). Is it possible that there's a problem in the NAND ECC or bad sector handling? All devices are showing me 0 bad sectors with "nand bad". I'm using the "u-boot-2012.07-f+s-V2.0".


    This is a serious problem for us.


    Thanks for any help.

  • We finally have found the problem. There was a bug in the ECC correction code that used a wrong variable. Instead of comparing the newly computed ECC to the ECC that was read from the NAND spare area, it was always compared to the just computed ECC, i.e. to itself. Obviously this was always the same and never found any errors. This is fixed now and now bitflips are detected and corrected. The ECC uses a 1-bit ECC per 512 bytes of data. So there may be up to 4 bitflips within one 2K page.


    While looking for this error, we also found another small bug that caused that the last two blocks of the NAND flash were not used in U-Boot. This is not very serious, but we fixed it nonetheless.


    These bugs affect PicoMOD7A, NetDCU14 and armStoneA8. The patches are attached to this posting. Download them, then go to the top directory of the U-Boot source tree and apply them with the following commands:


    Code
    1. patch -p1 < 0001-Fix-NAND-ECC-correction-on-fss5pv210.patch
    2. patch -p1 < 0002-Do-not-reduce-NAND-size-by-skip-size-on-fss5pv210.patch


    Then rebuild U-Boot with


    Code
    1. make fss5pv210_config
    2. make


    The result is the file uboot.nb0 that can be installed to the board in NBoot or U-Boot.


    Your F&S Support Team