<div dir="ltr">Hi,<div><br></div><div>Is there any progress on this issue? Today we've triggered yet another segmentation fault, that does seem similar to the previous one, but it is slightly different. I don't know how pgpool manages its backend connections, but this segmentation fault seem to suggest something got corrupted in its pool. This causes a segmentation fault trying to free the connection, causing the free to fail, triggering segmentation faults over and over again on every attempt to get a connection. This resulted in 10 core dumps in under 2 minutes, all with the same backtrace.</div><div><br></div><div>The test scenario that failed is about emergency disaster recovery after a failed application upgrade in a cluster of 3 nodes. Prior to the update, a file system snapshot is created on all nodes. When the upgrade fails, the nodes are rebooted and the snapshots are restored. This means, all nodes will be down simultaneously and will come back up in an unpredictable order. We do accept it when pgpool fails to get all nodes back in sync in this scenario, but segmentation faults will always cause a failed test. The segmentation faults occur on node 3, which seems to be the second node to come back up again. The order is 2 -> 3 -> 1. The segmentation faults occur when node 1 joins the cluster.</div><div><br></div><div>#0 pool_create_cp () at protocol/pool_connection_pool.c:326<br>#1 0x00005562c3dccbb5 in connect_backend (sp=0x5562c59aacd8, frontend=0x5562c59a8d88) at protocol/child.c:1051<br>#2 0x00005562c3dcf02a in get_backend_connection (frontend=0x5562c59a8d88) at protocol/child.c:2112<br>#3 0x00005562c3dcafd5 in do_child (fds=0x5562c595ea90) at protocol/child.c:416<br>#4 0x00005562c3d90a4c in fork_a_child (fds=0x5562c595ea90, id=16) at main/pgpool_main.c:863<br>#5 0x00005562c3d8fe30 in PgpoolMain (discard_status=0 '\000', clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>#6 0x00005562c3d8d9e6 in main (argc=2, argv=0x7ffe9916ea48) at main/main.c:365<br></div><div><br></div><div>Best regards,</div><div>Emond</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op vr 24 mei 2024 om 11:48 schreef Emond Papegaaij <<a href="mailto:emond.papegaaij@gmail.com">emond.papegaaij@gmail.com</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi,</div><div><br></div><div>It turned out, I wasn't entirely accurate in my previous mail. Our tests perform the downgrade of docker on the nodes one-by-one, not in parallel, and give the cluster some time to recover in between. This means pgpool and postgresql are stopped simultaneously on a single node, docker is downgraded and the containers are restarted. The crash occurs on node 1 when the containers are stopped on node 2. Node 2 is the first node on which the containers are stopped. At that moment, node 1 is the watchdog leader and runs the primary database.</div><div><br></div><div>Most of our cluster tests start by resuming vms from a previously made snapshot. This can major issues in both pgpool and postgresql, as the machines experience gaps in time and might not recover in the correct order, introducing unreliability in our tests. Therefore, we stop all pgpool instances and the standby postgresql databases just prior to creating the snapshot. After restoring the snapshots, we make sure the database on node 1 is primary, start pgpool on node 1, 2 and 3 in that order, and perform a pg_basebackup for the database on node 2 and 3 to make sure they are in sync and following node 1. This accounts for the messages about failovers and stops/starts you see in the log prior to the crash. This process is completed at 01:00:52Z.</div><div><br></div><div>Best regards,</div><div>Emond</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op vr 24 mei 2024 om 09:37 schreef Emond Papegaaij <<a href="mailto:emond.papegaaij@gmail.com" target="_blank">emond.papegaaij@gmail.com</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>Last night, another one of our test runs failed with a core dump of pgpool. At the moment of the crash, pgpool was part of a 3 node cluster. 3 vms all running an instance of pgpool and and a postgresql database. The crash happened on node 1 while setting up all 3 vms for the test that would follow. This test is about upgrading docker, so during the preparation, docker has to be downgraded. This requires all docker containers being stopped, including the containers running pgpool and postgresql. Being a test, we do not care about availability, only about execution time, so this is done in parallel on all 3 vms. So on all 3 vms, pgpool and postgresql are stopped simultaneously. I can understand that is a situation is difficult to handle correctly in pgpool, but still it should not cause a segmentation fault. I've attached the pgpool logs for the node that crashed and the core dump. I do have logging from the other nodes as well, if required. The crash happens at 01:01:00Z.</div><div><br></div><div>#0 connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at protocol/child.c:1076<br>#1 0x000055803ce3d02a in get_backend_connection (frontend=0x55803eb08768) at protocol/child.c:2112<br>#2 0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at protocol/child.c:416<br>#3 0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at main/pgpool_main.c:863<br>#4 0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000', clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>#5 0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at main/main.c:365<br></div><div><br></div><div>Best regards,</div><div>Emond</div></div>
</blockquote></div></div>
</blockquote></div>