<div dir="ltr"><div>Hi,</div><div><br></div><div>No worries. I hope you had a good trip. Last night we triggered the last crash again. Is there anything we can do to make it easier for you to find the cause?</div><div><br></div><div>Best regards,</div><div>Emond</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op ma 3 jun 2024 om 05:40 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

I had been traveling last week and had no chance to look into this.<br>

I will start to study them.<br>

<br>

Best reagards,<br>

--<br>

Tatsuo Ishii<br>

SRA OSS LLC<br>

English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>

Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

<br>

> Hi,<br>

> <br>

> Is there any progress on this issue? Today we've triggered yet another<br>

> segmentation fault, that does seem similar to the previous one, but it is<br>

> slightly different. I don't know how pgpool manages its backend<br>

> connections, but this segmentation fault seem to suggest something got<br>

> corrupted in its pool. This causes a segmentation fault trying to free the<br>

> connection, causing the free to fail, triggering segmentation faults over<br>

> and over again on every attempt to get a connection. This resulted in 10<br>

> core dumps in under 2 minutes, all with the same backtrace.<br>

> <br>

> The test scenario that failed is about emergency disaster recovery after a<br>

> failed application upgrade in a cluster of 3 nodes. Prior to the update, a<br>

> file system snapshot is created on all nodes. When the upgrade fails, the<br>

> nodes are rebooted and the snapshots are restored. This means, all nodes<br>

> will be down simultaneously and will come back up in an unpredictable<br>

> order. We do accept it when pgpool fails to get all nodes back in sync in<br>

> this scenario, but segmentation faults will always cause a failed test. The<br>

> segmentation faults occur on node 3, which seems to be the second node to<br>

> come back up again. The order is 2 -> 3 -> 1. The segmentation faults occur<br>

> when node 1 joins the cluster.<br>

> <br>

> #0  pool_create_cp () at protocol/pool_connection_pool.c:326<br>

> #1  0x00005562c3dccbb5 in connect_backend (sp=0x5562c59aacd8,<br>

> frontend=0x5562c59a8d88) at protocol/child.c:1051<br>

> #2  0x00005562c3dcf02a in get_backend_connection (frontend=0x5562c59a8d88)<br>

> at protocol/child.c:2112<br>

> #3  0x00005562c3dcafd5 in do_child (fds=0x5562c595ea90) at<br>

> protocol/child.c:416<br>

> #4  0x00005562c3d90a4c in fork_a_child (fds=0x5562c595ea90, id=16) at<br>

> main/pgpool_main.c:863<br>

> #5  0x00005562c3d8fe30 in PgpoolMain (discard_status=0 '\000',<br>

> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>

> #6  0x00005562c3d8d9e6 in main (argc=2, argv=0x7ffe9916ea48) at<br>

> main/main.c:365<br>

> <br>

> Best regards,<br>

> Emond<br>

> <br>

> Op vr 24 mei 2024 om 11:48 schreef Emond Papegaaij <<br>

> <a href="mailto:emond.papegaaij@gmail.com" target="_blank">emond.papegaaij@gmail.com</a>>:<br>

> <br>

>> Hi,<br>

>><br>

>> It turned out, I wasn't entirely accurate in my previous mail. Our tests<br>

>> perform the downgrade of docker on the nodes one-by-one, not in parallel,<br>

>> and give the cluster some time to recover in between. This means pgpool and<br>

>> postgresql are stopped simultaneously on a single node, docker is<br>

>> downgraded and the containers are restarted. The crash occurs on node 1<br>

>> when the containers are stopped on node 2. Node 2 is the first node on<br>

>> which the containers are stopped. At that moment, node 1 is the watchdog<br>

>> leader and runs the primary database.<br>

>><br>

>> Most of our cluster tests start by resuming vms from a previously made<br>

>> snapshot. This can major issues in both pgpool and postgresql, as the<br>

>> machines experience gaps in time and might not recover in the correct<br>

>> order, introducing unreliability in our tests. Therefore, we stop all<br>

>> pgpool instances and the standby postgresql databases just prior to<br>

>> creating the snapshot. After restoring the snapshots, we make sure the<br>

>> database on node 1 is primary, start pgpool on node 1, 2 and 3 in that<br>

>> order, and perform a pg_basebackup for the database on node 2 and 3 to make<br>

>> sure they are in sync and following node 1. This accounts for the messages<br>

>> about failovers and stops/starts you see in the log prior to the crash.<br>

>> This process is completed at 01:00:52Z.<br>

>><br>

>> Best regards,<br>

>> Emond<br>

>><br>

>> Op vr 24 mei 2024 om 09:37 schreef Emond Papegaaij <<br>

>> <a href="mailto:emond.papegaaij@gmail.com" target="_blank">emond.papegaaij@gmail.com</a>>:<br>

>><br>

>>> Hi,<br>

>>><br>

>>> Last night, another one of our test runs failed with a core dump of<br>

>>> pgpool. At the moment of the crash, pgpool was part of a 3 node cluster. 3<br>

>>> vms all running an instance of pgpool and and a postgresql database. The<br>

>>> crash happened on node 1 while setting up all 3 vms for the test that would<br>

>>> follow. This test is about upgrading docker, so during the preparation,<br>

>>> docker has to be downgraded. This requires all docker containers being<br>

>>> stopped, including the containers running pgpool and postgresql. Being a<br>

>>> test, we do not care about availability, only about execution time, so this<br>

>>> is done in parallel on all 3 vms. So on all 3 vms, pgpool and postgresql<br>

>>> are stopped simultaneously. I can understand that is a situation is<br>

>>> difficult to handle correctly in pgpool, but still it should not cause a<br>

>>> segmentation fault. I've attached the pgpool logs for the node that crashed<br>

>>> and the core dump. I do have logging from the other nodes as well, if<br>

>>> required. The crash happens at 01:01:00Z.<br>

>>><br>

>>> #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at<br>

>>> protocol/child.c:1076<br>

>>> #1  0x000055803ce3d02a in get_backend_connection<br>

>>> (frontend=0x55803eb08768) at protocol/child.c:2112<br>

>>> #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at<br>

>>> protocol/child.c:416<br>

>>> #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at<br>

>>> main/pgpool_main.c:863<br>

>>> #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',<br>

>>> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>

>>> #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at<br>

>>> main/main.c:365<br>

>>><br>

>>> Best regards,<br>

>>> Emond<br>

>>><br>

>><br>

</blockquote></div></div>