Problem with NAND-Flash?

nostromo · Oct 22nd 2012

Hello all

We may have a issue with the NAND on the armStoneA8. I flashed a kernel some days before, the system boots from NAND without any problems. After some days the system did not boot with the kernel anymore. System hangs here:

Code

NAND read: device 0 offset 0x500000, size 0x300000
3145728 bytes read: OK
## Booting kernel from zImage at 41000000
XIP Kernel Image ... OK
Starting kernel ...
Uncompressing Linux... done, booting the kernel.

I save the kernel from NAND to mmc by serial console and did a binary compare. One byte differ(214 != 14). Is this a faulty NAND-Flash-issue?

Regards,

Denis

fs-support_HK · Oct 22nd 2012

Well, I can not believe this. Because if there is a single bit error in the image, the ECC mechanism should correct it. If the error can not be corrected by ECC anymore (more than one 1-bit error), then this should show up as an ECC error when trying to load the image from NAND. 14 (0x0E) and 214 (0xD6) differ in more than one bit, so this should show up as an ECC error. However when loading your image, the loading works fine ("3145728 bytes read: OK").

Probably your kernel image exceeds the 3MB limit of the Kernel partition. We had such a situation before. When you download the image, store it to flash and run it immediately, everything is OK, because you still have the full image in RAM. But when you start after a fresh reboot, then you only load the first 3MB-part from NAND to RAM and the remaining part of the image is missing.

EDIT: Of course these error numbers are meant per NAND page. So there may be up to one 1-bit error per 2048 bytes.

nostromo · Oct 23rd 2012

Good evening

There is no error shown on loading from NAND(see below). The size of the image is: 2913728. I flashed the kernel some days before and reboot several times(power cycle, RESET and restart) without any problems. I did not saw this issue on the boards delivered some month before. I saw this issue on both boards of the latest delivery. One of them show this behaviour in it's vanilla state. The u-boot is unmodified. I just flashed the kernel. Next days I like to check the bigger part of the NAND for some errors and will also check the NAND-boot-image again.

In the past I was to busy to had a closer look to the issue, but I do not remember that somethings like a ECC-Error appears.

Did you have any suggestions to take into account, or can you see any mistake by my side?

The binary diff of the images:

Code

$ cmp -l -b /tftpboot/zImage /tftpboot/zImage-flash-kaputt
2799562 14 ^L 214 M-^L

The complete output on the serial/u-boot:

Code

U-Boot 2011.12 (May 22 2012 - 20:46:29) for F&S
CPU: S5PV210@1000MHz
Board: armStoneA8 Rev 1.20 (4x DRAM, 2x LAN, 1000 MHz)
DRAM: 512 MiB
WARNING: Caches not enabled
NAND: 128 MiB
MMC: SAMSUNG SD/MMC: 0
In: serial
Out: serial
Err: serial
Net: AX88796-0, AX88796-1
Hit any key to stop autoboot: 0
---- Trying autoload from mmc0:1 ----
reading update.scr
Failed!
---- Trying autoload from usb0:1 ----
USB: Register 1111 NbrPorts 1
USB EHCI 1.00
scanning bus for devices... 1 USB Device(s) found
scanning bus for storage devices... 0 Storage Device(s) found
Failed!
---- No autoload script found ----
NAND read: device 0 offset 0x500000, size 0x300000
3145728 bytes read: OK
## Booting kernel from zImage at 41000000
XIP Kernel Image ... OK
Starting kernel ...
Uncompressing Linux... done, booting the kernel.

Display More

..hangs here..

fs-support_HK · Oct 24th 2012

At least there is no apparent error from your side. However it is really strange. As already said, U-Boot does not show an ECC error. So from the view of U-Boot, everything seems to be fine. The message "Uncompressing Linux... done, booting the kernel." is already part of the decompressor in the zImage. If the corrupted byte would be part of the compressed data, uncompressing the image would also fail. But this is also not the case. So also the zImage seems to be OK. Therefore I would say everything is fine. But why does the kernel not start?

Can you show me your bootcmd? Do you really load to 41000000? Or do you use $(loadaddr)? Loading to 41000000 with 512MB of RAM will result in having only 256MB RAM visible in Linux (see also here). So it's better to load to 31000000. Maybe you've accidentally changed loadaddr and saved the environment after that. Try the following two commands to unset loadaddr:

Code

setenv loadaddr
saveenv

You can also compare the image within U-Boot. Just load the Kernel from NAND to one RAM position and the original Kernel to some other RAM position and then compare with cmp.b.

Code

tftp 30000000 zImage-original
nand read 31000000 Kernel $(filesize)
cmp.b 30000000 31000000 $(filesize)

To verify if this one byte is actually the problem, you could also try to erase just the one NAND block of data with the byte in question in it and rewrite just this block. NAND blocks are 128KB in size on the armStoneA8 (=0x20000). So if this byte is at offset 2799562 in the file (=0x2ab7ca), then this is in the block from offset 0x2a0000 to 0x2bffff. The kernel partition starts at 0x500000. So try this:

Code

tftp 30000000 zImage-original
nand erase 52a0000 20000
nand write 302a0000 52a0000 20000

Check again (by reading back the NAND content to 31000000 and cmp with 30000000) if the kernel is now stored correctly. Now reboot again. Is the kernel now starting?

nostromo · Oct 31st 2012

Thanks for your feedback.

I was wrong with the linux cmp compare tool. The output is octal, so result differ in just one bit. I did a compare with the u-boot cmp.b(your proposal), and the output shows:

Code

cmp.b 30800000 31800000 $(filesize)
byte at 0x30aab7c9 (0x0c) != byte at 0x31aab7c9 (0x8c)
Total of 2799561 bytes were the same

A think this tool stops at the first distinction,but I think there is just 1-Bit wrong. This is the same like some days before.

After that I did a erase/write of the broken block and the system starts like expected:

Code

armStoneA8 # nand erase 7a0000 20000
NAND erase: device 0 offset 0x7a0000, size 0x20000
Erasing at 0x7a0000 -- 100% complete.
OK
armStoneA8 # nand write 302a0000 7a0000 20000
NAND write: device 0 offset 0x7a0000, size 0x20000
131072 bytes written: OK
armStoneA8 # nand read 32800000 500000 $(filesize)
NAND read: device 0 offset 0x500000, size 0x2c75c0
2913728 bytes read: OK
armStoneA8 # cmp.b 30000000 32800000 $(filesize)
Total of 2913728 bytes were the same

Display More

I want to full-check the NAND content next time. So far a simple 'nand read ..' shows no errors for the hole NAND(124MByte readed).

Some question remaining:

-Why u-boot did not show any ECC-Error loading the broken kernel-image?
-How to deal with this kind of issue? We can't deal with system-do-not-boot/NAND-issues like that in our products.
-What's happend if the u-boot/nBoot-loader get an NAND issue like this?

I think about to implement a boot fallback into u-boot. This may fix a lot of these issues, but that's not enough.

fs-support_HK · Oct 31st 2012

We will release a V2.0 rather soon. There U-Boot got quite a lot of changes, also in the ECC handling. We will do some more tests, and we explicitly will provoke some bit errors and check the ECC behavior again. I agree, it is not acceptable that U-Boot loads something from flash that contains an error but does not show an error.

fs-support_HK · Jul 18th 2014

OK, quite some time has gone by, but we actually have found a problem with the ECC computation now. This will affect armStoneA8, NetDCU14 and PicoMOD7A. Please see this posting in the PicoMOD7A section with the patches and how they are applied.

Your F&S Support Team

Share