[pgpool-general: 9112] Re: Another segmentation fault

Tatsuo Ishii ishii at sraoss.co.jp
Mon Jun 3 12:40:27 JST 2024


Hi,

I had been traveling last week and had no chance to look into this.
I will start to study them.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi,
> 
> Is there any progress on this issue? Today we've triggered yet another
> segmentation fault, that does seem similar to the previous one, but it is
> slightly different. I don't know how pgpool manages its backend
> connections, but this segmentation fault seem to suggest something got
> corrupted in its pool. This causes a segmentation fault trying to free the
> connection, causing the free to fail, triggering segmentation faults over
> and over again on every attempt to get a connection. This resulted in 10
> core dumps in under 2 minutes, all with the same backtrace.
> 
> The test scenario that failed is about emergency disaster recovery after a
> failed application upgrade in a cluster of 3 nodes. Prior to the update, a
> file system snapshot is created on all nodes. When the upgrade fails, the
> nodes are rebooted and the snapshots are restored. This means, all nodes
> will be down simultaneously and will come back up in an unpredictable
> order. We do accept it when pgpool fails to get all nodes back in sync in
> this scenario, but segmentation faults will always cause a failed test. The
> segmentation faults occur on node 3, which seems to be the second node to
> come back up again. The order is 2 -> 3 -> 1. The segmentation faults occur
> when node 1 joins the cluster.
> 
> #0  pool_create_cp () at protocol/pool_connection_pool.c:326
> #1  0x00005562c3dccbb5 in connect_backend (sp=0x5562c59aacd8,
> frontend=0x5562c59a8d88) at protocol/child.c:1051
> #2  0x00005562c3dcf02a in get_backend_connection (frontend=0x5562c59a8d88)
> at protocol/child.c:2112
> #3  0x00005562c3dcafd5 in do_child (fds=0x5562c595ea90) at
> protocol/child.c:416
> #4  0x00005562c3d90a4c in fork_a_child (fds=0x5562c595ea90, id=16) at
> main/pgpool_main.c:863
> #5  0x00005562c3d8fe30 in PgpoolMain (discard_status=0 '\000',
> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> #6  0x00005562c3d8d9e6 in main (argc=2, argv=0x7ffe9916ea48) at
> main/main.c:365
> 
> Best regards,
> Emond
> 
> Op vr 24 mei 2024 om 11:48 schreef Emond Papegaaij <
> emond.papegaaij at gmail.com>:
> 
>> Hi,
>>
>> It turned out, I wasn't entirely accurate in my previous mail. Our tests
>> perform the downgrade of docker on the nodes one-by-one, not in parallel,
>> and give the cluster some time to recover in between. This means pgpool and
>> postgresql are stopped simultaneously on a single node, docker is
>> downgraded and the containers are restarted. The crash occurs on node 1
>> when the containers are stopped on node 2. Node 2 is the first node on
>> which the containers are stopped. At that moment, node 1 is the watchdog
>> leader and runs the primary database.
>>
>> Most of our cluster tests start by resuming vms from a previously made
>> snapshot. This can major issues in both pgpool and postgresql, as the
>> machines experience gaps in time and might not recover in the correct
>> order, introducing unreliability in our tests. Therefore, we stop all
>> pgpool instances and the standby postgresql databases just prior to
>> creating the snapshot. After restoring the snapshots, we make sure the
>> database on node 1 is primary, start pgpool on node 1, 2 and 3 in that
>> order, and perform a pg_basebackup for the database on node 2 and 3 to make
>> sure they are in sync and following node 1. This accounts for the messages
>> about failovers and stops/starts you see in the log prior to the crash.
>> This process is completed at 01:00:52Z.
>>
>> Best regards,
>> Emond
>>
>> Op vr 24 mei 2024 om 09:37 schreef Emond Papegaaij <
>> emond.papegaaij at gmail.com>:
>>
>>> Hi,
>>>
>>> Last night, another one of our test runs failed with a core dump of
>>> pgpool. At the moment of the crash, pgpool was part of a 3 node cluster. 3
>>> vms all running an instance of pgpool and and a postgresql database. The
>>> crash happened on node 1 while setting up all 3 vms for the test that would
>>> follow. This test is about upgrading docker, so during the preparation,
>>> docker has to be downgraded. This requires all docker containers being
>>> stopped, including the containers running pgpool and postgresql. Being a
>>> test, we do not care about availability, only about execution time, so this
>>> is done in parallel on all 3 vms. So on all 3 vms, pgpool and postgresql
>>> are stopped simultaneously. I can understand that is a situation is
>>> difficult to handle correctly in pgpool, but still it should not cause a
>>> segmentation fault. I've attached the pgpool logs for the node that crashed
>>> and the core dump. I do have logging from the other nodes as well, if
>>> required. The crash happens at 01:01:00Z.
>>>
>>> #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at
>>> protocol/child.c:1076
>>> #1  0x000055803ce3d02a in get_backend_connection
>>> (frontend=0x55803eb08768) at protocol/child.c:2112
>>> #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>>> protocol/child.c:416
>>> #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at
>>> main/pgpool_main.c:863
>>> #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
>>> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>>> #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
>>> main/main.c:365
>>>
>>> Best regards,
>>> Emond
>>>
>>



More information about the pgpool-general mailing list