[pgpool-general: 9130] Re: Another segmentation fault

Emond Papegaaij emond.papegaaij at gmail.com
Fri Jun 14 16:31:36 JST 2024


Hi,

Thanks for the patch. I'll add it to the build of our pgpool containers.

Since I've added the previous patch, I've got the feeling that the number
of segmentation faults has been reduced. It's hard to say though if the
problem is really fixed, because some of these crashes happen very
infrequently.

Yesterday, we did see 2 crashes, both with the same backtrace, which I
haven't seen before:
#0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
auth/pool_auth.c:349
#1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
frontend=0x55bc42c93788) at protocol/child.c:1102
#2  0x000055bc41859042 in get_backend_connection (frontend=0x55bc42c93788)
at protocol/child.c:2111
#3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
protocol/child.c:416
#4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at
main/pgpool_main.c:863
#5  0x000055bc418256f7 in exec_child_restart
(failover_context=0x7ffdf033c6b0, node_id=0) at main/pgpool_main.c:4684
#6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
#7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
main/pgpool_main.c:1507
#8  0x000055bc418263c3 in check_requests () at main/pgpool_main.c:4934
#9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
#10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
main/main.c:365

Looking at the backtrace, this happens right after the place where the
first patch made a change. The slots in 'cp' are {0x0, 0x7ff7ca3479b8,
0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and the
coredump. This crash happens during a different test (in both cases). It
crashes in the test for our normal upgrade procedure. The crash happens
when the cluster is in a fully consistent configuration, all 3 nodes are up
and healthy. node 0 is both watchdog leader and running the primary
database. Because node 0 is the first to be upgraded, we need to initiate a
failover. For this, we first restart pgpool to force it to drop its leader
status and wait until a new leader has been elected. We then stop the
database to trigger the failover. The crash seems to happen when the
database started up again. In the attached logs, the crash was at node 0,
but in the other failure, the crash was at the exact same moment in the
test, but at node 2. I'm also seeing some suspicious backend status reports
from pcp_node_info at that time:

Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0 none
none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby 130337
streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST

ps. I had to resend the mail with the attachments bzip2ed because it was
hitting the size limit.

Best regards,
Emond

Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:

> > For the second crash (in pool_connection_pool.c:326), all slots are null:
> > (gdb) info locals
> > i = 0
> > freed = 0
> > closetime = 0
> > oldestp = 0x7f474f36e168
> > ret = 0x0
> > info = 0x55b511283469 <opt_sort>
> > p = 0x7f474f36e168
> >
> > (gdb) p *p
> > $7 = {
> >   info = 0x7f47470d5c08,
> >   slots = {0x0 <repeats 128 times>}
> > }
> >
> > (gdb) p p->slots[0]
> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> > (gdb) p p->slots[1]
> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> > (gdb) p p->slots[2]
> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>
> Possible explanation is, Req_info->main_node_id, which is the smallest
> backend node id number being alive, was -1. From the log file just
> before pid 30 segfaults:
>
> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
> backend:0 is set to down status
> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5e07f2"
> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
> backend:1 is set to down status
> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5e07f2"
> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
> backend:2 is set to down status
> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5
>
> That means all backends were down, and Req_info->main_node_id should
> have been -1 at the time. Usually if all backend are down, pgpool will
> not accept connections from frontend. I think when the check was
> peformed, not all backend are down. Right after this all backend went
> down I guess. If so, at line 270 of pool_connection_pool.c:
>
> 270                 if (MAIN_CONNECTION(p) == NULL)
>
> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
> looked into garbage and above condition was not met. Then the code
> proceeded to line 326:
>
>  326                         pool_free_startup_packet(CONNECTION_SLOT(p,
> i)->sp);
>
> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
>
> I think using MAIN_CONNECTION macro to take care of connection
> poolings is incorrect. Which backend is alive in the each connection
> pool slot could be different from current alive backend. So I invented
> a function in_use_backend_id() for seaching live backend ids in a
> connection slot. Attached is the patch in this direction. I hope the
> patch to fix the second segfault case.
>
> > I hope this helps.
> >
> > Best regards,
> > Emond
> >
> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> >
> >> > No worries. I hope you had a good trip. Last night we triggered the
> last
> >> > crash again. Is there anything we can do to make it easier for you to
> >> find
> >> > the cause?
> >>
> >> It would be helpful if you could share some variable values in the
> >> core file. Since I don't have the pgpool load module when you got the
> >> core, I cannot inspect the variables using the core you provided.
> >>
> >> > #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at
> >> > protocol/child.c:1076
> >> > #1  0x000055803ce3d02a in get_backend_connection
> >> (frontend=0x55803eb08768)
> >> > at protocol/child.c:2112
> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
> >> > protocol/child.c:416
> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at
> >> > main/pgpool_main.c:863
> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> >> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
> >> > main/main.c:365
> >>
> >> Around protocol/child.c:1076:
> >>                                 /* set DB node id */
> >>                                 pool_set_db_node_id(CONNECTION(backend,
> >> i), i);
> >>
> >> I want to see the values in the "backend" struct. Since CONNECTION
> >> macro is used here, you have to do something like in a gdb session.
> >>
> >> p *backend->slots[0]
> >> p *backend->slots[1]
> >> p *backend->slots[2]
> >>
> >> Best reagards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS LLC
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240614/235ffb92/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool-upgrade-test-crash.log.bz2
Type: application/x-bzip2
Size: 749512 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240614/235ffb92/attachment.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: core.pgpool.208.71356cb34a6a4b6eb2fe393d367aae45.123470.1718273101000000.bz2
Type: application/x-bzip2
Size: 914754 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240614/235ffb92/attachment-0001.bz2>


More information about the pgpool-general mailing list