[galcore]: Stop driver to keep scene.


  • Board: ArmStoneA9r2 rev1.20

    CPU: IMX6-DualLite

    Build: fsimx6B2020.04

    Hi,

    I'm running a Qt application with default platform settings (EGLFS). After few hours, the application UI is frozen but the application is still running.
    The output of dmesg contains the error [galcore]: Stop driver to keep scene and some memory dump. During this , free memory from the command free has reached to 4MB.

    Help me solve this issue.


  • Unfortunately we can not decode this error message, too. The display driver including the GPU part is only partly available in source code in the Linux kernel, a big part of it is closed source and only available as binary libraries in the root filesystem.


    But the hint about the free memory sounds interesting. Do you still have the output of the free command from this case? Is it really only 4 MB of free RAM? Sometimes the output is misleading, as most free memory is used for the buffer cache to speed up file accesses. But such cache can be counted as free memory. When memory is needed, the least recently used pages from the buffer cache are simply thrown away and can be used for the new memory request.


    Your F&S Support Team

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • As I already assumed, 65MB of memory are used by the buffer cache and 62MB of it are still available as free memory. So I would say the system was not out of memory when this happened. At least not out of regular memory. There are other cases where memory may still have been short. For example there is the CMA region (Continuous Memory Allocator). Normal memory is allocated using the paging mechanism. So an arbitrary page in memory can be used and it is given a virtual address so that it can be part of a larger continuous memory region allocated by malloc() for example. But it is only continuous in the virtual address space. On the physical address space, it can be arbitrarily fragmented. Theoretically each page can be located somewhere else in physical memory and still the virtual addresses show one continuous region.


    The CMA region is different. Here also the physical address region has to be continuous. This is needed when the region is accessed by DMA, where the DMA only sees the physical addresses. So for example a (frame) buffer that can be shown on the display needs to be a continuous memory region in the physical address space, too, so that the display controller can send the data to the display via DMA. If the CMA region is exhausted or very fragmented, maybe the display driver can not allocate any large enough buffers anymore and will get into such a non-standard state. But this is just a guess.


    Maybe the CMA debugfs can be useful to prove or disprove this.


    https://www.kernel.org/doc/htm…guide/mm/cma_debugfs.html


    If allocating (and freeing) a large block via debugfs works with a newly started system, but fails to work on a system that is running for many hours, than this could be an indication for the CMA running out of continuous memory.


    If this is actually the case, it may be your application or Qt that may be responsible. Probably there is a memory leak, some buffer that is allocated from time to time but never freed, causing the CMA pool to get more and more fragmented, so that at some point of time it is not possible anymore to allocate a rather large continuous memory block required for the display driver.


    Your F&S Support Team

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • SoM: "model": "F&S i.MX6 Solo/DualLite efusA9 module"

    Galcore version reported as 6.2.4.150331
    We have from F&S kernel version: 4.9.88 (without any git history)

    Hi,
    We too are experiencing issues with [galcore]: Stop driver to keep scene.
    (with occurrence of extent aproximately 1-16 times per week)
    Please see dmesg.txt for typical output of this error.

    Based on idea that this problem might be related to some memory issues we did some investigations.
    I have found this thread :
    https://community.nxp.com/t5/i…ry-Allocation/td-p/718999
    Which is questioning CMA size. Therefore we checked there.

    At DTS we have <tpro200.dts>it seems to be correct:



    CMA > Contiguous.Size

    320MB > 33MB

    We also did some testing on devices with memtester and free (stucked devices always had enough memory) which does not lead to any more deterministic behavior.

    Therefore we did some monitoring via Munin (3 weeks) to investigate what might be related to this behavior. We found out that nor the temperature, cpu usage, load, number of processes, memory usage seems to be connected to this issue based on our observation.

    But from the code itself where part of the driver is opened and can be decoded. i.g. error message:
    [galcore]: Stop driver to keep scene.

    Seems to be unique and located inside driver module when driver does not have active "recovery" functionality (part of gckKERNEL_Recovery() function).

    Therefore when some generic GPU error occurs driver prints this error message and ends.

    So from the available code we found out that there is recovery mode which is responsible for "Trying to recover the GPU from a fatal error"

    But this functionality was not active by default in our case.
    Therefore as a workaround we modified u-boot to provide additional kernel parameters (it did not work when changed on runtime):

    Code
    1. setenv extra "galcore.recovery=1 galcore.stuckDump=0"


    With this modification we were able to:

    • Reduce message log (mem dump) to single message:

      [galcore]: GPU[%d] hang, automatic recovery.

      via parameter stuckDump=0
    • Avoid stopping of driver and force the recovery of GPU:

      via parameter recovery=1

    This resulted in instant recovery "at least from the kernel point of view"

    While in the reality as it could be observed from interrupt logs and on screen itself -> recovery itself took ~30min which was still not be acceptable. We found that Weston is actually responsible for observed delay.

    Therefore we made another workaround script which is:
    1) Detecting new event in kernel log:
    <timestamp> [galcore] : recovery done

    2) Restarting Weston and application
    systemctl kill -s KILL weston
    systemctl start weston
    app restart


    So with these workarounds we were able to make device somehow work without necessity to reboot every time when GPU stucked.
    But such workaround is still not acceptable and root cause of why GPU tend to get into faulty state should be found and solved.

    Our galcore.showArgs:


    Content of other parameters folder of GPU driver to "fiddle" with:

    Code
    1. root@tpro200:~# ls /sys/module/galcore/parameters/
    2. baseAddress externalSize irqLineVG powerManagement registerMemSize type
    3. chipIDs fastClear irqs recovery registerMemSize2D
    4. compression gpuProfiler logFileSize registerBases registerMemSizeVG
    5. contiguousBase initgpu3DMinClock major registerMemBase registerSizes
    6. contiguousSize irqLine mmu registerMemBase2D showArgs
    7. externalBase irqLine2D physSize registerMemBaseVG stuckDump

    Content of DTs:

    arch/arm/boot/dts/imx6dl.dtsi

    arch/arm/boot/dts/imx6qdl.dtsi



    While doing some research we found that that compatible string for vivante,gc is present in kernel documentation from version 5.7 while our kernel is only 4.9.88.

    Documentation/devicetree/bindings/gpu/vivante,gc.yaml

    https://elixir.bootlin.com/lin…dings/gpu/vivante,gc.yaml

    Is there a plan from F&S side to update kernel and vivante version to possibly eliminate this issue? Or can you please get any more information from NXP what might be wrong?

  • Hello,


    The compatible string that is used for the propetary vivante GPU driver should be


    "fsl,imx6dl-gpu" from arch/arm/boot/dts/imx6dl.dtsi ->gpu@00130000


    From my understanding the nodes

    gpu_3d: gpu@00130000 and gpu_2d: gpu@00134000 are not used at all with the vivante gpu driver.


    We are currently working on a new release with kernel version 5.4.70 and Vivante GPU driver version 6.4.3.p1.4.

    We will try to contact NXP if similar errors a known.


    Your F&S Support Team

  • Hello radim.pavlik@tbs-biometrics.com ,


    gone through your below post on forum and after updating the parameter recovery=1 application recovery took around 11 Minutes. have you found any solution/work around for galcore problem?