[pgpool-general: 9113] Re: Another segmentation fault

Emond Papegaaij emond.papegaaij at gmail.com
Mon Jun 3 16:17:29 JST 2024


Hi,

No worries. I hope you had a good trip. Last night we triggered the last
crash again. Is there anything we can do to make it easier for you to find
the cause?

Best regards,
Emond

Op ma 3 jun 2024 om 05:40 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:

> Hi,
>
> I had been traveling last week and had no chance to look into this.
> I will start to study them.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi,
> >
> > Is there any progress on this issue? Today we've triggered yet another
> > segmentation fault, that does seem similar to the previous one, but it is
> > slightly different. I don't know how pgpool manages its backend
> > connections, but this segmentation fault seem to suggest something got
> > corrupted in its pool. This causes a segmentation fault trying to free
> the
> > connection, causing the free to fail, triggering segmentation faults over
> > and over again on every attempt to get a connection. This resulted in 10
> > core dumps in under 2 minutes, all with the same backtrace.
> >
> > The test scenario that failed is about emergency disaster recovery after
> a
> > failed application upgrade in a cluster of 3 nodes. Prior to the update,
> a
> > file system snapshot is created on all nodes. When the upgrade fails, the
> > nodes are rebooted and the snapshots are restored. This means, all nodes
> > will be down simultaneously and will come back up in an unpredictable
> > order. We do accept it when pgpool fails to get all nodes back in sync in
> > this scenario, but segmentation faults will always cause a failed test.
> The
> > segmentation faults occur on node 3, which seems to be the second node to
> > come back up again. The order is 2 -> 3 -> 1. The segmentation faults
> occur
> > when node 1 joins the cluster.
> >
> > #0  pool_create_cp () at protocol/pool_connection_pool.c:326
> > #1  0x00005562c3dccbb5 in connect_backend (sp=0x5562c59aacd8,
> > frontend=0x5562c59a8d88) at protocol/child.c:1051
> > #2  0x00005562c3dcf02a in get_backend_connection
> (frontend=0x5562c59a8d88)
> > at protocol/child.c:2112
> > #3  0x00005562c3dcafd5 in do_child (fds=0x5562c595ea90) at
> > protocol/child.c:416
> > #4  0x00005562c3d90a4c in fork_a_child (fds=0x5562c595ea90, id=16) at
> > main/pgpool_main.c:863
> > #5  0x00005562c3d8fe30 in PgpoolMain (discard_status=0 '\000',
> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> > #6  0x00005562c3d8d9e6 in main (argc=2, argv=0x7ffe9916ea48) at
> > main/main.c:365
> >
> > Best regards,
> > Emond
> >
> > Op vr 24 mei 2024 om 11:48 schreef Emond Papegaaij <
> > emond.papegaaij at gmail.com>:
> >
> >> Hi,
> >>
> >> It turned out, I wasn't entirely accurate in my previous mail. Our tests
> >> perform the downgrade of docker on the nodes one-by-one, not in
> parallel,
> >> and give the cluster some time to recover in between. This means pgpool
> and
> >> postgresql are stopped simultaneously on a single node, docker is
> >> downgraded and the containers are restarted. The crash occurs on node 1
> >> when the containers are stopped on node 2. Node 2 is the first node on
> >> which the containers are stopped. At that moment, node 1 is the watchdog
> >> leader and runs the primary database.
> >>
> >> Most of our cluster tests start by resuming vms from a previously made
> >> snapshot. This can major issues in both pgpool and postgresql, as the
> >> machines experience gaps in time and might not recover in the correct
> >> order, introducing unreliability in our tests. Therefore, we stop all
> >> pgpool instances and the standby postgresql databases just prior to
> >> creating the snapshot. After restoring the snapshots, we make sure the
> >> database on node 1 is primary, start pgpool on node 1, 2 and 3 in that
> >> order, and perform a pg_basebackup for the database on node 2 and 3 to
> make
> >> sure they are in sync and following node 1. This accounts for the
> messages
> >> about failovers and stops/starts you see in the log prior to the crash.
> >> This process is completed at 01:00:52Z.
> >>
> >> Best regards,
> >> Emond
> >>
> >> Op vr 24 mei 2024 om 09:37 schreef Emond Papegaaij <
> >> emond.papegaaij at gmail.com>:
> >>
> >>> Hi,
> >>>
> >>> Last night, another one of our test runs failed with a core dump of
> >>> pgpool. At the moment of the crash, pgpool was part of a 3 node
> cluster. 3
> >>> vms all running an instance of pgpool and and a postgresql database.
> The
> >>> crash happened on node 1 while setting up all 3 vms for the test that
> would
> >>> follow. This test is about upgrading docker, so during the preparation,
> >>> docker has to be downgraded. This requires all docker containers being
> >>> stopped, including the containers running pgpool and postgresql. Being
> a
> >>> test, we do not care about availability, only about execution time, so
> this
> >>> is done in parallel on all 3 vms. So on all 3 vms, pgpool and
> postgresql
> >>> are stopped simultaneously. I can understand that is a situation is
> >>> difficult to handle correctly in pgpool, but still it should not cause
> a
> >>> segmentation fault. I've attached the pgpool logs for the node that
> crashed
> >>> and the core dump. I do have logging from the other nodes as well, if
> >>> required. The crash happens at 01:01:00Z.
> >>>
> >>> #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at
> >>> protocol/child.c:1076
> >>> #1  0x000055803ce3d02a in get_backend_connection
> >>> (frontend=0x55803eb08768) at protocol/child.c:2112
> >>> #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
> >>> protocol/child.c:416
> >>> #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at
> >>> main/pgpool_main.c:863
> >>> #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
> >>> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> >>> #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
> >>> main/main.c:365
> >>>
> >>> Best regards,
> >>> Emond
> >>>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240603/c21bbd9f/attachment.htm>


More information about the pgpool-general mailing list