Problem with NAND-Flash?

  • Hello all


    We may have a issue with the NAND on the armStoneA8. I flashed a kernel some days before, the system boots from NAND without any problems. After some days the system did not boot with the kernel anymore. System hangs here:


    Code
    1. NAND read: device 0 offset 0x500000, size 0x300000
    2. 3145728 bytes read: OK
    3. ## Booting kernel from zImage at 41000000
    4. XIP Kernel Image ... OK
    5. Starting kernel ...
    6. Uncompressing Linux... done, booting the kernel.


    I save the kernel from NAND to mmc by serial console and did a binary compare. One byte differ(214 != 14). Is this a faulty NAND-Flash-issue?


    Regards,


    Denis

  • Well, I can not believe this. Because if there is a single bit error in the image, the ECC mechanism should correct it. If the error can not be corrected by ECC anymore (more than one 1-bit error), then this should show up as an ECC error when trying to load the image from NAND. 14 (0x0E) and 214 (0xD6) differ in more than one bit, so this should show up as an ECC error. However when loading your image, the loading works fine ("3145728 bytes read: OK").


    Probably your kernel image exceeds the 3MB limit of the Kernel partition. We had such a situation before. When you download the image, store it to flash and run it immediately, everything is OK, because you still have the full image in RAM. But when you start after a fresh reboot, then you only load the first 3MB-part from NAND to RAM and the remaining part of the image is missing.


    EDIT: Of course these error numbers are meant per NAND page. So there may be up to one 1-bit error per 2048 bytes.

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • Good evening


    There is no error shown on loading from NAND(see below). The size of the image is: 2913728. I flashed the kernel some days before and reboot several times(power cycle, RESET and restart) without any problems. I did not saw this issue on the boards delivered some month before. I saw this issue on both boards of the latest delivery. One of them show this behaviour in it's vanilla state. The u-boot is unmodified. I just flashed the kernel. Next days I like to check the bigger part of the NAND for some errors and will also check the NAND-boot-image again.


    In the past I was to busy to had a closer look to the issue, but I do not remember that somethings like a ECC-Error appears.


    Did you have any suggestions to take into account, or can you see any mistake by my side?


    The binary diff of the images:

    Code
    1. $ cmp -l -b /tftpboot/zImage /tftpboot/zImage-flash-kaputt
    2. 2799562 14 ^L 214 M-^L


    The complete output on the serial/u-boot:


    ..hangs here..

  • At least there is no apparent error from your side. However it is really strange. As already said, U-Boot does not show an ECC error. So from the view of U-Boot, everything seems to be fine. The message "Uncompressing Linux... done, booting the kernel." is already part of the decompressor in the zImage. If the corrupted byte would be part of the compressed data, uncompressing the image would also fail. But this is also not the case. So also the zImage seems to be OK. Therefore I would say everything is fine. But why does the kernel not start?


    Can you show me your bootcmd? Do you really load to 41000000? Or do you use $(loadaddr)? Loading to 41000000 with 512MB of RAM will result in having only 256MB RAM visible in Linux (see also here). So it's better to load to 31000000. Maybe you've accidentally changed loadaddr and saved the environment after that. Try the following two commands to unset loadaddr:

    Code
    1. setenv loadaddr
    2. saveenv


    You can also compare the image within U-Boot. Just load the Kernel from NAND to one RAM position and the original Kernel to some other RAM position and then compare with cmp.b.

    Code
    1. tftp 30000000 zImage-original
    2. nand read 31000000 Kernel $(filesize)
    3. cmp.b 30000000 31000000 $(filesize)


    To verify if this one byte is actually the problem, you could also try to erase just the one NAND block of data with the byte in question in it and rewrite just this block. NAND blocks are 128KB in size on the armStoneA8 (=0x20000). So if this byte is at offset 2799562 in the file (=0x2ab7ca), then this is in the block from offset 0x2a0000 to 0x2bffff. The kernel partition starts at 0x500000. So try this:

    Code
    1. tftp 30000000 zImage-original
    2. nand erase 52a0000 20000
    3. nand write 302a0000 52a0000 20000


    Check again (by reading back the NAND content to 31000000 and cmp with 30000000) if the kernel is now stored correctly. Now reboot again. Is the kernel now starting?

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • Thanks for your feedback.


    I was wrong with the linux cmp compare tool. The output is octal, so result differ in just one bit. I did a compare with the u-boot cmp.b(your proposal), and the output shows:


    Code
    1. cmp.b 30800000 31800000 $(filesize)
    2. byte at 0x30aab7c9 (0x0c) != byte at 0x31aab7c9 (0x8c)
    3. Total of 2799561 bytes were the same


    A think this tool stops at the first distinction,but I think there is just 1-Bit wrong. This is the same like some days before.


    After that I did a erase/write of the broken block and the system starts like expected:



    I want to full-check the NAND content next time. So far a simple 'nand read ..' shows no errors for the hole NAND(124MByte readed).


    Some question remaining:


    -Why u-boot did not show any ECC-Error loading the broken kernel-image?
    -How to deal with this kind of issue? We can't deal with system-do-not-boot/NAND-issues like that in our products.
    -What's happend if the u-boot/nBoot-loader get an NAND issue like this?


    I think about to implement a boot fallback into u-boot. This may fix a lot of these issues, but that's not enough.

  • We will release a V2.0 rather soon. There U-Boot got quite a lot of changes, also in the ECC handling. We will do some more tests, and we explicitly will provoke some bit errors and check the ECC behavior again. I agree, it is not acceptable that U-Boot loads something from flash that contains an error but does not show an error.

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • OK, quite some time has gone by, but we actually have found a problem with the ECC computation now. This will affect armStoneA8, NetDCU14 and PicoMOD7A. Please see this posting in the PicoMOD7A section with the patches and how they are applied.


    Your F&S Support Team

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.