memory management problem after upgrade of BuildRoot V1.2 to BuildRoot V2.0

  • Hello to all of you, especially support guys.


    We are using in our project QBlissA9 modules. Because its long term project, we started with HW revision of QBliss module 1.3 equipped with BuildRoot V1.2. So I setup BuildRoot and compile our project applications against this version. No problem here.
    After a while come module revision 1.31 equipped with BuildRoot V2.0. So I did BuildRoot setup and recompiled all applications once again. No big problem here.
    But no, we need to unify all SW versions, because its not really possible to support both version of system. Especially when we had big issue with


    UBIFS error (pid 1): ubifs_recover_master_node: failed to recover master node


    error during boot when we use BuildRoot V1.2. Though this problem still persists, its not the point if my appeal.


    To unify SW version, I took the older module, originally with BuildRoot V1.2, upgrade NBoot and UBoot according to instructions to V2.0 version, upload kernel and rootfs V2.0, device boots up but after while messages like


    Pagealloc : memory_corruption


    are reported. And more and more of these comes. After that its usually matter of minutes when device freeze and watchdog invoke reboot. I saw this behavior on many modules originally equipped with BuildRoot of version V1.2.


    So here’s my point:


    Is there any difference between older modules ( revision 1.3 ) which were equipped with BuildRoot ( and kernel ) of version V1.2 and modules revision 1.31 equipped with BuildRoot ( and kernel ) of version V2.0, which could cause memory management problems when V1.2 is upgraded to V2.0 ? After upgrade to V2.0 kernel and rootfs is still the same, but system is stable on modules which were originally equipped with V2.0 ( revision 1.31 usually ) and unstable on modules which were equipped with V1.2 ( revision 1.3 usually ) and were upgraded manually to V2.0.


    I found the application ( I guess ) which execution starts mentioned memory mess ( four instances of mplayer ), but it worked on original BuildRoot V1.2, on BuildRoot V2.0, but modules equipped with V1.2 – upgraded to V2.0 – installed same Buildroot V2.0 troubles me. I did proper recompilation of mplayer and other applications, of course, and many times.


    Now Im out of ideas what to do or which direction I should look. Maybe I just miss something.


    So, any ideas ?


    Know its quite long, but ….


    Thanks and have a nice day.


  • There seems to be a problem with the RAM settings. These are set in NBoot. So my first suggestion would be to use the newest NBoot, which is VN27. Log in to our website and go to QBlissA9 -> NBoot and download "nbootimx6_27.bin". Then replace NBoot on the board.


    Your F&S Support Team

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • Thanks guys for update, have to admit that Imiss release of newer NBott, I use NBoot VN25 from BuildRoot V2.0 and V2.1. So I upgrade NBoot to VN27, cleared flash, upload UBoot and then upload via USB kernel and rootfs. But behavior seems to be the same :(.
    Any other idea to check or try, Im out of ideas slowly.


    Thanks anyway, appreciate any response.

  • Dear Mobius,
    changes between Rev 1.30 and Rev 1.31 are
    - fix a solder shape for a non mounted external RTC (we need this for one customer)
    - fix the USB device (was powering the CPU board thrue VBUS)
    - changing the production panel size


    All of this changes doesn't have influence on SW.
    We already shipped >800 Boards with this new revision.
    Do you have this problem just with a single board ore with multiple boards?
    If it just a single board, please send this back thrue our RMA process for further investigation.


    regards
    fs-support_KW

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • Hello guys,


    Sorry for delay of my response.


    Now I have over 15 boards, tested on upgrade and same software, with same result. Last week I brought two devices with older version of SW, did upgrade and let them run. One with revision 1.31 seemed to be stable, second one with revision 1.3 worked only for few minutes before memory corruption messages and watchdog reboot.


    Colleague in workshop tests another 15 boards, so Im not sure how many of them cause same malfunction.


    At this moment four boards should be on way to you for test or checks.


    After yesterdays result I had to stop upgrade preparation and found another way to solve this out. So if you have any advice what to try, test or do, any ideas are welcome.


    Thanks and have a nice day.

  • In the past days we were busy trying to search for possible causes for the problem. Maybe we have found something. Can you try the attached patch (on top of fsimx6-V2.0 or fsimx6-V2.1)? Go to the top directory of the Linux source and apply the modification as follows:


    Code
    1. patch -p1 < 0001-Use-F-S-DDR3-settings.patch


    Then recompile the kernel and download it to the board. Does this remove the memory corruption errors?


    Your F&S Support Team

    Files

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • Hi guys,
    thanks for update. Im afraid applying patch to FS2.0 kernel does not solve problem. But it does make sense, so I’ll keep it in.


    Patch testing configuration was NBootV27, UBootFS2.0 /UBootFS2.1 with kernel FS2.0 and original binary of kernel FS2.1. Module revision 1.3.


    In the meantime we did some simple tests. Whatever cause these troubles, it seems to be NAND independable. Running rootfs from mmc card ( and kernel too last week, but now I wasn’t able to repeat this, huh ) did very same behavior.


    Forbidding all our application we were able to confirm that memory corruptions can be invoked only thru running mplayer instance which plays live video stream from IP camera. Original buildroot’s MPlayer behaves same way as we use our compilation of MPlayer.


    Simple streaming into file ( using ffmpeg ) does not cause any memory corruption or instabilities, so as playing any video file by MPlayers. Based on these I would dig into mplayer ( and Live555 which is used ), but on some modules takes hours to complete system crash, on some minutes ( as actual tested ) and on some there is absolutely no problem ( with same images ).


    Looking for some any other stress tool ( or memory tester instead of one in NBoot ) I found
    quoting:
    All the really nasty errors are dynamical errors, which usually happen not with simple read or write cycles on the bus, but when performing back-to-back data transfers in burst mode. Such accesses usually happen only for certain DMA operations, or for heavy cache use (instruction fetching, cache flushing).


    Generation of BUG kmemleak_object: Poison overwritten messages indicating corruption of slab pages like
    Object 0xba8dfa80: 6b 6b 6b 6b 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkk..kkkkkkkkkk
    during receiving and playing of video stream makes me believe that something can happened during system stress. So we will try to make some stress tests and let you know if we found anything suspicious.


    If you have any advice where to look, what to check, other unreleased patches or we could send you something to help, let us know :) .


    Thanks a lot anyway and have a nice day.

  • I'm not sure whether mplayer uses the Video Processing Unit (VPU) hardware acceleration of the i.MX6. If yes, then switching to fsimx6-V2.1 may also be a good option. In this release, we included a completely different (newer) version of the GPU/VPU binary code that is compatible to the version usually used together with i.MX6 kernel 3.10, so if there was an error in the older version, chances are good that it was found and fixed in the version that we use in fsimx6-V2.1.


    If mplayer is not using hardware acceleration, I would strongly recommend switching to a different player. For example when using gstreamer, you can play Full-HD videos with 30 fps, scaled down to your display size and you only have a CPU load of about 10 to 20% because everything works in hardware (VPU).


    By the way, behind the scenes we are also already working hard on porting the regular i.MX6 boards to a newer kernel (at least 3.14.52) and there again an even newer version of the GPU/VPU libraries will be used.


    Your F&S Support Team

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • Yes, youre right.
    MPlayer itself doesn’t use HW acceleration, its video canvas is rendered just by software. It’s not able to play any big resolution in extra quality, but its easy to maintain mplayer instance as process.


    Board has to render four independent streams from IP cameras ( each with resolution around 400x200 ) and switch layout according input events. So it was easy maintain one instance of MPlayer for each visible camera.


    I did experiments with gstreamer too, but found issue with four views added into one screen with blinking according to refresh rate only of the first one, so other views were snatchy. But its quite a time, too.


    Right now, gstreamer doesn’t cause instability, I saw memory corruption only once, and stress tool does not invoke any bad allocation either.


    OK, Ill try to find what can I do before total migration to newer version FS2.1.
    But if you find anything – Ill be here :) .

  • Hi guys,


    So after a while Im back. As to your recommendation, I did migration to new version FSV2.1.


    After recompilations and testing I found few thinks.


    On newer QBliss boards works very well. It runs smoother, boots faster and looks stable. Few devices worked thru two days without crushing. That’s on boards which came as board revision 1.31, equipped with imx6 revision 1.5.
    On the other hand, older ones equipped with imx6 revision 1.2 seems to crush once upon while, with same software ( and proper upgrade to VN27 NBoot, UBoot2.1 ). Crush messages usually starts with


    Unable to handle kernel NULL pointer dereference at virtual address 00000000 ( or some other very low address ),


    but stack looks almost different. Sometimes one core just freeze, without any debug message.


    I did some digging around, try to use some errata workaround available in kernel, but without reasonable improvement.Same behavior using defconfig versions of configuration.


    All I saw led me to suspicion if there could be issue with mapping of physical pages to virtual memory ( conjunction with MMU ) or calculation of address space, or with floating point unit on early revisions of imx6. We used another toolchain with softfloat, if I remember well.


    So, are you guy aware of some changes during revisions, which could affect that all works well on quite new revisions ( of board and imx6 ) and all of older ones crushes ? Or what to try, Im out of ideas what to reasonably try or where to look. By myself, I found some errata descriptions, but it doesn’t help me much.


    We have to decide what to do during upcoming week, because situation looks unsolvable.


    Thanks, appreciate any advice

  • Here is an update on this issue.


    We have received your modules and have done several tests and we can confirm your issue. In the last days we have done many many more tests and finally it seems we have found the source of the problem. When you used the fsimx6-V1.x version, you still used the old NBoot (< VN20). Now with fsimx6-V2.x, you needed to switch to NBoot VN25 or newer because of the new NAND flash data layout. These two NBoots use different settings for the drive strength on the DDR RAM signals. In the new version the drive strength is lower than in the old one. Of course we are trying to reduce the drive strength of most of the signals as much as possible as this has positive effects on the EMC (electromagnetic compatibility). But it seems that now the drive strength is slightly too low for same batches of RAM, and this causes sporadic read errors, especially when RAM load is very high, i.e. when there are many RAM accesses.


    We will prepare a new NBoot where the drive strength is increased again. Of course we have to do some tests to verify that this does not cause new problems on our other i.MX6 boards and modules, but I am rather confident that we can release this NBoot soon. Then you can update NBoot and everything should be fine again.


    Your F&S Support Team

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • Hi to all of you guys,


    Thanks very much for response.


    As we did some test ( mostly just functional test with our application ) we were able to check which board should probably work and which don’t.


    Comparing board releases, i.mx chip series lead us to the fact that there is no direct relation between crashes and one “type” of chip. And so as to board revision and memory chip type. We can just notice older board ( originally with FS1.2 SW ) have higher probability of troubles ( after upgrade do FSV2.1 ).
    We also found few new boards ( with FSV2.0 at origin ), according to serial and imx type numbers, which freeze applications or crash after few hours of testing. But older ones usually crashes after minutes.


    Second remarkable thing is simple RAM test with build in memtester. Even its only able to test amount of free space in RAM intended to user usage, shows difference in modules behavior. Some modules shows ok status during all tests, cycles and reboots. These works in tests for 1 – 2 days without even sign of problem.
    Other ones show usually many Failures in RAM values comparing. Every cycle or test on same device shows different offset address where failure was detected. These modules crashes system with memory allocations, NULL pointer issue, invalid instruction, segmentation fault and so on.


    So if you say that there could be something with RAM powering issue, it does make sense to me. As I’ve been digging thru my tip was something with frequency or MMU, but I didn’t find any patch or other directly mentioning it.


    I don’t believe that there could be so much board with corrupter memory chips and you are able to fix this with modified NBoot, its very good news. That’s the bright side.


    I don’t want to sound ungrateful, but after next week we have to upgrade according to customer order all devices ( over four hundred devices ), means I have literary one week to prepare and fix all issues.


    So, is there any chance to get some fix during next week ? It doesn’t have to be official version, even untested ( we can test on our boards ), I’m aware that its not easy to make this.


    I take anything, anything, and if it’s true about powering, my little bit desperate situation could be expressed in classic quote


    Help me Obi-Wan, you’re my only hope :shock:


    Appreciate your effort, thanks so much, hope everything goes well.