[pgpool-general: 9111] Re: Another segmentation fault

Fri May 31 20:54:36 JST 2024

Hi,

Is there any progress on this issue? Today we've triggered yet another
segmentation fault, that does seem similar to the previous one, but it is
slightly different. I don't know how pgpool manages its backend
connections, but this segmentation fault seem to suggest something got
corrupted in its pool. This causes a segmentation fault trying to free the
connection, causing the free to fail, triggering segmentation faults over
and over again on every attempt to get a connection. This resulted in 10
core dumps in under 2 minutes, all with the same backtrace.

The test scenario that failed is about emergency disaster recovery after a
failed application upgrade in a cluster of 3 nodes. Prior to the update, a
file system snapshot is created on all nodes. When the upgrade fails, the
nodes are rebooted and the snapshots are restored. This means, all nodes
will be down simultaneously and will come back up in an unpredictable
order. We do accept it when pgpool fails to get all nodes back in sync in
this scenario, but segmentation faults will always cause a failed test. The
segmentation faults occur on node 3, which seems to be the second node to
come back up again. The order is 2 -> 3 -> 1. The segmentation faults occur
when node 1 joins the cluster.

#0  pool_create_cp () at protocol/pool_connection_pool.c:326
#1  0x00005562c3dccbb5 in connect_backend (sp=0x5562c59aacd8,
frontend=0x5562c59a8d88) at protocol/child.c:1051
#2  0x00005562c3dcf02a in get_backend_connection (frontend=0x5562c59a8d88)
at protocol/child.c:2112
#3  0x00005562c3dcafd5 in do_child (fds=0x5562c595ea90) at
protocol/child.c:416
#4  0x00005562c3d90a4c in fork_a_child (fds=0x5562c595ea90, id=16) at
main/pgpool_main.c:863
#5  0x00005562c3d8fe30 in PgpoolMain (discard_status=0 '\000',
clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
#6  0x00005562c3d8d9e6 in main (argc=2, argv=0x7ffe9916ea48) at
main/main.c:365

Best regards,
Emond

Op vr 24 mei 2024 om 11:48 schreef Emond Papegaaij <
emond.papegaaij at gmail.com>:

> Hi,
>
> It turned out, I wasn't entirely accurate in my previous mail. Our tests
> perform the downgrade of docker on the nodes one-by-one, not in parallel,
> and give the cluster some time to recover in between. This means pgpool and
> postgresql are stopped simultaneously on a single node, docker is
> downgraded and the containers are restarted. The crash occurs on node 1
> when the containers are stopped on node 2. Node 2 is the first node on
> which the containers are stopped. At that moment, node 1 is the watchdog
> leader and runs the primary database.
>
> Most of our cluster tests start by resuming vms from a previously made
> snapshot. This can major issues in both pgpool and postgresql, as the
> machines experience gaps in time and might not recover in the correct
> order, introducing unreliability in our tests. Therefore, we stop all
> pgpool instances and the standby postgresql databases just prior to
> creating the snapshot. After restoring the snapshots, we make sure the
> database on node 1 is primary, start pgpool on node 1, 2 and 3 in that
> order, and perform a pg_basebackup for the database on node 2 and 3 to make
> sure they are in sync and following node 1. This accounts for the messages
> about failovers and stops/starts you see in the log prior to the crash.
> This process is completed at 01:00:52Z.
>
> Best regards,
> Emond
>
> Op vr 24 mei 2024 om 09:37 schreef Emond Papegaaij <
> emond.papegaaij at gmail.com>:
>
>> Hi,
>>
>> Last night, another one of our test runs failed with a core dump of
>> pgpool. At the moment of the crash, pgpool was part of a 3 node cluster. 3
>> vms all running an instance of pgpool and and a postgresql database. The
>> crash happened on node 1 while setting up all 3 vms for the test that would
>> follow. This test is about upgrading docker, so during the preparation,
>> docker has to be downgraded. This requires all docker containers being
>> stopped, including the containers running pgpool and postgresql. Being a
>> test, we do not care about availability, only about execution time, so this
>> is done in parallel on all 3 vms. So on all 3 vms, pgpool and postgresql
>> are stopped simultaneously. I can understand that is a situation is
>> difficult to handle correctly in pgpool, but still it should not cause a
>> segmentation fault. I've attached the pgpool logs for the node that crashed
>> and the core dump. I do have logging from the other nodes as well, if
>> required. The crash happens at 01:01:00Z.
>>
>> #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at
>> protocol/child.c:1076
>> #1  0x000055803ce3d02a in get_backend_connection
>> (frontend=0x55803eb08768) at protocol/child.c:2112
>> #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>> protocol/child.c:416
>> #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at
>> main/pgpool_main.c:863
>> #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
>> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
>> main/main.c:365
>>
>> Best regards,
>> Emond
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240531/32b69159/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool-segfault.log
Type: text/x-log
Size: 349412 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240531/32b69159/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: core.pgpool.208.d5ff746e487c489ebdb1d024d1d2c2c2.6222.1717132814000000.lz4
Type: application/x-lz4
Size: 6299566 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240531/32b69159/attachment-0001.bin>