[galcore]: Stop driver to keep scene.

Raviteja · Apr 26th 2021

Board: ArmStoneA9r2 rev1.20

CPU: IMX6-DualLite

Build: fsimx6B2020.04

Hi,

I'm running a Qt application with default platform settings (EGLFS). After few hours, the application UI is frozen but the application is still running.
The output of dmesg contains the error [galcore]: Stop driver to keep scene and some memory dump. During this , free memory from the command free has reached to 4MB.

Help me solve this issue.

fs-support_HK · Apr 27th 2021

Unfortunately we can not decode this error message, too. The display driver including the GPU part is only partly available in source code in the Linux kernel, a big part of it is closed source and only available as binary libraries in the root filesystem.

But the hint about the free memory sounds interesting. Do you still have the output of the free command from this case? Is it really only 4 MB of free RAM? Sometimes the output is misleading, as most free memory is used for the buffer cache to speed up file accesses. But such cache can be counted as free memory. When memory is needed, the least recently used pages from the buffer cache are simply thrown away and can be used for the new memory request.

Your F&S Support Team

Raviteja · Apr 27th 2021

Hi,

I've the attached the output of free command when this happened.

Application was running for more than a day, before this happened.

fs-support_HK · Apr 27th 2021

As I already assumed, 65MB of memory are used by the buffer cache and 62MB of it are still available as free memory. So I would say the system was not out of memory when this happened. At least not out of regular memory. There are other cases where memory may still have been short. For example there is the CMA region (Continuous Memory Allocator). Normal memory is allocated using the paging mechanism. So an arbitrary page in memory can be used and it is given a virtual address so that it can be part of a larger continuous memory region allocated by malloc() for example. But it is only continuous in the virtual address space. On the physical address space, it can be arbitrarily fragmented. Theoretically each page can be located somewhere else in physical memory and still the virtual addresses show one continuous region.

The CMA region is different. Here also the physical address region has to be continuous. This is needed when the region is accessed by DMA, where the DMA only sees the physical addresses. So for example a (frame) buffer that can be shown on the display needs to be a continuous memory region in the physical address space, too, so that the display controller can send the data to the display via DMA. If the CMA region is exhausted or very fragmented, maybe the display driver can not allocate any large enough buffers anymore and will get into such a non-standard state. But this is just a guess.

Maybe the CMA debugfs can be useful to prove or disprove this.

https://www.kernel.org/doc/htm…guide/mm/cma_debugfs.html

If allocating (and freeing) a large block via debugfs works with a newly started system, but fails to work on a system that is running for many hours, than this could be an indication for the CMA running out of continuous memory.

If this is actually the case, it may be your application or Qt that may be responsible. Probably there is a memory leak, some buffer that is allocated from time to time but never freed, causing the CMA pool to get more and more fragmented, so that at some point of time it is not possible anymore to allocate a rather large continuous memory block required for the display driver.

Your F&S Support Team

Raviteja · May 4th 2021

Hi,

I have run the Valgrind check on the application. There are no memory leaks (definitely lost blocks) from the application.
Also I have found this Issue on NXP forums .

radim.pavlik@tbs-biometrics.com · Oct 4th 2021

SoM: "model": "F&S i.MX6 Solo/DualLite efusA9 module"

Galcore version reported as 6.2.4.150331
We have from F&S kernel version: 4.9.88 (without any git history)

Hi,
We too are experiencing issues with [galcore]: Stop driver to keep scene.
(with occurrence of extent aproximately 1-16 times per week)
Please see dmesg.txt for typical output of this error.

Based on idea that this problem might be related to some memory issues we did some investigations.
I have found this thread :
https://community.nxp.com/t5/i…ry-Allocation/td-p/718999
Which is questioning CMA size. Therefore we checked there.

At DTS we have <tpro200.dts>it seems to be correct:

CMA > Contiguous.Size

320MB > 33MB

We also did some testing on devices with memtester and free (stucked devices always had enough memory) which does not lead to any more deterministic behavior.

Therefore we did some monitoring via Munin (3 weeks) to investigate what might be related to this behavior. We found out that nor the temperature, cpu usage, load, number of processes, memory usage seems to be connected to this issue based on our observation.

But from the code itself where part of the driver is opened and can be decoded. i.g. error message:
[galcore]: Stop driver to keep scene.

Seems to be unique and located inside driver module when driver does not have active "recovery" functionality (part of gckKERNEL_Recovery() function).

Therefore when some generic GPU error occurs driver prints this error message and ends.

Code

gckKERNEL_Recovery()
{
.
.
.
if (Kernel->recovery == gcvFALSE)
{
gcmkPRINT("[galcore]: Stop driver to keep scene.");
/* Stop monitor timer. */
Kernel->monitorTimerStop = gcvTRUE;
/* Success. */
gcmkFOOTER_NO();
return gcvSTATUS_OK;
}
.
.
}

Display More

So from the available code we found out that there is recovery mode which is responsible for "Trying to recover the GPU from a fatal error"

But this functionality was not active by default in our case.
Therefore as a workaround we modified u-boot to provide additional kernel parameters (it did not work when changed on runtime):

Code

setenv extra "galcore.recovery=1 galcore.stuckDump=0"

With this modification we were able to:

Reduce message log (mem dump) to single message:

[galcore]: GPU[%d] hang, automatic recovery.

via parameter stuckDump=0
Avoid stopping of driver and force the recovery of GPU:

via parameter recovery=1

This resulted in instant recovery "at least from the kernel point of view"

While in the reality as it could be observed from interrupt logs and on screen itself -> recovery itself took ~30min which was still not be acceptable. We found that Weston is actually responsible for observed delay.

Therefore we made another workaround script which is:
1) Detecting new event in kernel log:
<timestamp> [galcore] : recovery done

2) Restarting Weston and application
systemctl kill -s KILL weston
systemctl start weston
app restart

So with these workarounds we were able to make device somehow work without necessity to reboot every time when GPU stucked.
But such workaround is still not acceptable and root cause of why GPU tend to get into faulty state should be found and solved.

Our galcore.showArgs:

Code

galcore: clk_get vg clock failed, disable vg!
Galcore version 6.2.4.150331
Galcore options:
irqLine = 21
registerMemBase = 0x00130000
registerMemSize = 0x00004000
irqLine2D = 22
registerMemBase2D = 0x00134000
registerMemSize2D = 0x00004000
contiguousSize = 0x08000000
contiguousBase = 0x00000000
externalSize = 0x00000000
externalBase = 0x00000000
bankSize = 0x00000000
fastClear = -1
compression = 15
signal = 48
powerManagement = 1
baseAddress = 0x10000000
physSize = 0x80000000
logFileSize = 0 KB
recovery = 1
stuckDump = 0
gpuProfiler = 0
irqs =
-1,
-1,
-1,
-1,
-1,
-1,
-1,
-1,
-1,
-1,
registerBases =
0x00000000,
0x00000000,
0x00000000,
0x00000000,
0x00000000,
0x00000000,
0x00000000,
0x00000000,
0x00000000,
0x00000000,
registerSizes =
0x00000800,
0x00000800,
0x00000800,
0x00000800,
0x00000800,
0x00000800,
0x00000800,
0x00000800,
0x00000800,
0x00000800,
chipIDs =
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
0xFFFFFFFF,
Build options:
gcdGPU_TIMEOUT = 20000
gcdGPU_2D_TIMEOUT = 20000
gcdINTERRUPT_STATISTIC = 1

Display More

Content of other parameters folder of GPU driver to "fiddle" with:

Code

root@tpro200:~# ls /sys/module/galcore/parameters/
baseAddress externalSize irqLineVG powerManagement registerMemSize type
chipIDs fastClear irqs recovery registerMemSize2D
compression gpuProfiler logFileSize registerBases registerMemSizeVG
contiguousBase initgpu3DMinClock major registerMemBase registerSizes
contiguousSize irqLine mmu registerMemBase2D showArgs
externalBase irqLine2D physSize registerMemBaseVG stuckDump

Content of DTs:

arch/arm/boot/dts/imx6dl.dtsi

Code

soc{
...
gpu@00130000 {
compatible = "fsl,imx6dl-gpu", "fsl,imx6q-gpu";
reg = <0x00130000 0x4000>, <0x00134000 0x4000>,
<0x10000000 0x0>, <0x0 0x8000000>;
reg-names = "iobase_3d", "iobase_2d",
"phys_baseaddr", "contiguous_mem";
interrupts = <0 9 IRQ_TYPE_LEVEL_HIGH>,
<0 10 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "irq_3d", "irq_2d";
clocks = <&clks IMX6QDL_CLK_OPENVG_AXI>, <&clks IMX6QDL_CLK_GPU3D_AXI>,
<&clks IMX6QDL_CLK_GPU2D_CORE>, <&clks IMX6QDL_CLK_GPU3D_CORE>,
<&clks IMX6QDL_CLK_DUMMY>;
clock-names = "gpu2d_axi_clk", "gpu3d_axi_clk",
"gpu2d_clk", "gpu3d_clk",
"gpu3d_shader_clk";
resets = <&src 0>, <&src 3>;
reset-names = "gpu3d", "gpu2d";
power-domains = <&gpc 1>;
};
};

Display More

arch/arm/boot/dts/imx6qdl.dtsi

Code

soc{ ... gpu_3d: gpu@00130000 {
compatible = "vivante,gc";
reg = <0x00130000 0x4000>;
interrupts = <0 9 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clks IMX6QDL_CLK_GPU3D_AXI>,
<&clks IMX6QDL_CLK_GPU3D_CORE>,
<&clks IMX6QDL_CLK_GPU3D_SHADER>;
clock-names = "bus", "core", "shader";
power-domains = <&gpc 1>;
};
gpu_2d: gpu@00134000 {
compatible = "vivante,gc";
reg = <0x00134000 0x4000>;
interrupts = <0 10 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clks IMX6QDL_CLK_GPU2D_AXI>,
<&clks IMX6QDL_CLK_GPU2D_CORE>;
clock-names = "bus", "core";
power-domains = <&gpc 1>;
};
};

Display More

While doing some research we found that that compatible string for vivante,gc is present in kernel documentation from version 5.7 while our kernel is only 4.9.88.

Documentation/devicetree/bindings/gpu/vivante,gc.yaml

https://elixir.bootlin.com/lin…dings/gpu/vivante,gc.yaml

Is there a plan from F&S side to update kernel and vivante version to possibly eliminate this issue? Or can you please get any more information from NXP what might be wrong?

fs-support_PG · Oct 5th 2021

Hello,

The compatible string that is used for the propetary vivante GPU driver should be

"fsl,imx6dl-gpu" from arch/arm/boot/dts/imx6dl.dtsi ->gpu@00130000

From my understanding the nodes

gpu_3d: gpu@00130000 and gpu_2d: gpu@00134000 are not used at all with the vivante gpu driver.

We are currently working on a new release with kernel version 5.4.70 and Vivante GPU driver version 6.4.3.p1.4.

We will try to contact NXP if similar errors a known.

Your F&S Support Team

Narthan · Oct 18th 2022

Hello radim.pavlik@tbs-biometrics.com ,

gone through your below post on forum and after updating the parameter recovery=1 application recovery took around 11 Minutes. have you found any solution/work around for galcore problem?

Quote from radim.pavlik@tbs-biometrics.com

SoM: "model": "F&S i.MX6 Solo/DualLite efusA9 module"

Galcore version reported as 6.2.4.150331
We have from F&S kernel version: 4.9.88 (without any git history)

Hi,
We too are experiencing issues with [galcore]: Stop driver to keep scene.
(with occurrence of extent aproximately 1-16 times per week)
Please see dmesg.txt for typical output of this error.

Based on idea that this problem might be related to some memory issues we did some investigations.
I have found this thread :
https://community.nxp.com/t5/i…ry-Allocation/td-p/718999
Which is questioning CMA size. Therefore we checked there.

At DTS we have <tpro200.dts>it seems to be correct:

CMA > Contiguous.Size

320MB > 33MB

We also did some testing on devices with memtester and free (stucked devices always had enough memory) which does not lead to any more deterministic behavior.

Therefore we did some monitoring via Munin (3 weeks) to investigate what might be related to this behavior. We found out that nor the temperature, cpu usage, load, number of processes, memory usage seems to be connected to this issue based on our observation.

But from the code itself where part of the driver is opened and can be decoded. i.g. error message:
[galcore]: Stop driver to keep scene.

Seems to be unique and located inside driver module when driver does not have active "recovery" functionality (part of gckKERNEL_Recovery() function).

Therefore when some generic GPU error occurs driver prints this error message and ends.

Code

gckKERNEL_Recovery()

{

.

.

.

if (Kernel->recovery == gcvFALSE)

{

gcmkPRINT("[galcore]: Stop driver to keep scene.");

/* Stop monitor timer. */

Kernel->monitorTimerStop = gcvTRUE;

/* Success. */

gcmkFOOTER_NO();

return gcvSTATUS_OK;

}

.

.

}

Display More

So from the available code we found out that there is recovery mode which is responsible for "Trying to recover the GPU from a fatal error"

But this functionality was not active by default in our case.
Therefore as a workaround we modified u-boot to provide additional kernel parameters (it did not work when changed on runtime):

Code

setenv extra "galcore.recovery=1 galcore.stuckDump=0"

With this modification we were able to:

Reduce message log (mem dump) to single message:

[galcore]: GPU[%d] hang, automatic recovery.

via parameter stuckDump=0

Avoid stopping of driver and force the recovery of GPU:

via parameter recovery=1

This resulted in instant recovery "at least from the kernel point of view"

While in the reality as it could be observed from interrupt logs and on screen itself -> recovery itself took ~30min which was still not be acceptable. We found that Weston is actually responsible for observed delay.

Therefore we made another workaround script which is:
1) Detecting new event in kernel log:
<timestamp> [galcore] : recovery done

2) Restarting Weston and application
systemctl kill -s KILL weston
systemctl start weston
app restart

So with these workarounds we were able to make device somehow work without necessity to reboot every time when GPU stucked.
But such workaround is still not acceptable and root cause of why GPU tend to get into faulty state should be found and solved.

Our galcore.showArgs:

Code

galcore: clk_get vg clock failed, disable vg!

Galcore version 6.2.4.150331

Galcore options:

irqLine = 21

registerMemBase = 0x00130000

registerMemSize = 0x00004000

irqLine2D = 22

registerMemBase2D = 0x00134000

registerMemSize2D = 0x00004000

contiguousSize = 0x08000000

contiguousBase = 0x00000000

externalSize = 0x00000000

externalBase = 0x00000000

bankSize = 0x00000000

fastClear = -1

compression = 15

signal = 48

powerManagement = 1

baseAddress = 0x10000000

physSize = 0x80000000

logFileSize = 0 KB

recovery = 1

stuckDump = 0

gpuProfiler = 0

irqs =

-1,

-1,

-1,

-1,

-1,

-1,

-1,

-1,

-1,

-1,

registerBases =

0x00000000,

0x00000000,

0x00000000,

0x00000000,

0x00000000,

0x00000000,

0x00000000,

0x00000000,

0x00000000,

0x00000000,

registerSizes =

0x00000800,

0x00000800,

0x00000800,

0x00000800,

0x00000800,

0x00000800,

0x00000800,

0x00000800,

0x00000800,

0x00000800,

chipIDs =

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

0xFFFFFFFF,

Build options:

gcdGPU_TIMEOUT = 20000

gcdGPU_2D_TIMEOUT = 20000

gcdINTERRUPT_STATISTIC = 1

Display More

Content of other parameters folder of GPU driver to "fiddle" with:

Code

root@tpro200:~# ls /sys/module/galcore/parameters/

baseAddress externalSize irqLineVG powerManagement registerMemSize type

chipIDs fastClear irqs recovery registerMemSize2D

compression gpuProfiler logFileSize registerBases registerMemSizeVG

contiguousBase initgpu3DMinClock major registerMemBase registerSizes

contiguousSize irqLine mmu registerMemBase2D showArgs

externalBase irqLine2D physSize registerMemBaseVG stuckDump

Content of DTs:

arch/arm/boot/dts/imx6dl.dtsi

Code

soc{

...

gpu@00130000 {

compatible = "fsl,imx6dl-gpu", "fsl,imx6q-gpu";

reg = <0x00130000 0x4000>, <0x00134000 0x4000>,

<0x10000000 0x0>, <0x0 0x8000000>;

reg-names = "iobase_3d", "iobase_2d",

"phys_baseaddr", "contiguous_mem";

interrupts = <0 9 IRQ_TYPE_LEVEL_HIGH>,

<0 10 IRQ_TYPE_LEVEL_HIGH>;

interrupt-names = "irq_3d", "irq_2d";

clocks = <&clks IMX6QDL_CLK_OPENVG_AXI>, <&clks IMX6QDL_CLK_GPU3D_AXI>,

<&clks IMX6QDL_CLK_GPU2D_CORE>, <&clks IMX6QDL_CLK_GPU3D_CORE>,

<&clks IMX6QDL_CLK_DUMMY>;

clock-names = "gpu2d_axi_clk", "gpu3d_axi_clk",

"gpu2d_clk", "gpu3d_clk",

"gpu3d_shader_clk";

resets = <&src 0>, <&src 3>;

reset-names = "gpu3d", "gpu2d";

power-domains = <&gpc 1>;

};

};

Display More

arch/arm/boot/dts/imx6qdl.dtsi

Code

soc{ ... gpu_3d: gpu@00130000 {

compatible = "vivante,gc";

reg = <0x00130000 0x4000>;

interrupts = <0 9 IRQ_TYPE_LEVEL_HIGH>;

clocks = <&clks IMX6QDL_CLK_GPU3D_AXI>,

<&clks IMX6QDL_CLK_GPU3D_CORE>,

<&clks IMX6QDL_CLK_GPU3D_SHADER>;

clock-names = "bus", "core", "shader";

power-domains = <&gpc 1>;

};

gpu_2d: gpu@00134000 {

compatible = "vivante,gc";

reg = <0x00134000 0x4000>;

interrupts = <0 10 IRQ_TYPE_LEVEL_HIGH>;

clocks = <&clks IMX6QDL_CLK_GPU2D_AXI>,

<&clks IMX6QDL_CLK_GPU2D_CORE>;

clock-names = "bus", "core";

power-domains = <&gpc 1>;

};

};

Display More

While doing some research we found that that compatible string for vivante,gc is present in kernel documentation from version 5.7 while our kernel is only 4.9.88.

Documentation/devicetree/bindings/gpu/vivante,gc.yaml

https://elixir.bootlin.com/lin…dings/gpu/vivante,gc.yaml

Is there a plan from F&S side to update kernel and vivante version to possibly eliminate this issue? Or can you please get any more information from NXP what might be wrong?

Display More

Share