UBIFS "Corrupt empty space": Handling of 0-bits in empty NAND pages

  • New findings have shown that the ECC engine of NXP SoCs does not fix 0-bits in empty pages. In rare cases, this can lead to problems with UBIFS on boards based on i.MX6 CPUs (architectures fsimx6, fsimx6ul, fsimx6sx).


    Background


    When reading data from a NAND flash page, the error correction is done by the ECC engine of the SoC, called BCH. When done, the BCH engine returns a status code. If there were any bit errors, then they are corrected and the status is the number of corrected errors, or zero if there were none. The resulting payload data is always correct, unless there were so many errors that the ECC could not handle them anymore. In this case the status is "uncorrectable". A third status "erased" is returned if the payload data and ECC consists of 1-bits only. This is the case after a page is erased and completely empty.


    In rare cases, bits in an empty page may flip to zero. Such a page would be read as "uncorrectable", even though it is typically no problem to write data to this page nonetheless. For that, the BCH engine has the option to allow for a small number of 0-bits in an otherwise empty page and still return status "erased".

    The gpmi-nand-fus.c driver in Linux (and the mxs_nand_fus.c driver in U-Boot) makes use of this option. However the driver assumed that the read payload data is also corrected by the BCH engine in this case, i.e. that all read bytes are 0xff if the status is "erased", even if there were 0-bits in the page itself. Recent analysis has shown that this assumption was wrong. The resulting payload data still contains the 0-bits if there are any, no correction is done.


    For regular pages, this is no problem at all, because typical write operations do not read the page before writing to it. But UBIFS uses some pages for administration purposes and checks whether the region it wants to use is really empty. If UBIFS sees such 0-bits, it complains with "Corrupt empty space" and only mounts according volumes read-only. Fixing this issue requires manual corrections.


    When does it happen?


    - An erased page was untouched for a very long time ( the cell charge is reduced)

    - A write happens in a nearby page

    - The page now has at least one bit that has flipped from 1 to 0

    - The page is one of two pages that is used by UBIFS for administrative purposes

    - The flipped bit is in the region UBIFS wants to write to now

    - UBIFS checks for 0-bits in empty pages


    Solution


    The below patches fix the driver in Linux (and U-Boot) to work as if the ECC engine had corrected the 0-bits to 1, i.e. now the driver uses 0xff for all pages that have status "erased". You only have to fix U-Boot, too, if you are loading a boot file (e.g. kernel image, device tree) from an UBIFS volume (e.g. the rootfs). Select the appropriate patch for your Linux or U-Boot version and apply it by going to the top directory of the Linux or U-Boot source tree and call


    Code
    1.   patch -p1 < patch-name


    where you replace "patch-name" with the name of your patch.