[pgpool-general: 9136] Re: Another segmentation fault

Emond Papegaaij emond.papegaaij at gmail.com
Thu Jun 20 16:05:47 JST 2024


Hi,

I'm very sorry. Since we've increased the logging of the container, we
sometimes seem to get some gaps in the logs during reboots. I guess the
virtual machine is having trouble getting all logs out to disk before it
reboots. The attached logfile does contain the logs for the segmentation
fault for this crash.

Best regards,
Emond

Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:

> Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,
> but failed to find "segmentation fault" string.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi,
> >
> > Thanks for the patch. I'll add it to the build of our pgpool containers.
> >
> > Since I've added the previous patch, I've got the feeling that the number
> > of segmentation faults has been reduced. It's hard to say though if the
> > problem is really fixed, because some of these crashes happen very
> > infrequently.
> >
> > Yesterday, we did see 2 crashes, both with the same backtrace, which I
> > haven't seen before:
> > #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
> > auth/pool_auth.c:349
> > #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
> > frontend=0x55bc42c93788) at protocol/child.c:1102
> > #2  0x000055bc41859042 in get_backend_connection
> (frontend=0x55bc42c93788)
> > at protocol/child.c:2111
> > #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
> > protocol/child.c:416
> > #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at
> > main/pgpool_main.c:863
> > #5  0x000055bc418256f7 in exec_child_restart
> > (failover_context=0x7ffdf033c6b0, node_id=0) at main/pgpool_main.c:4684
> > #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
> > #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
> > main/pgpool_main.c:1507
> > #8  0x000055bc418263c3 in check_requests () at main/pgpool_main.c:4934
> > #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
> > main/main.c:365
> >
> > Looking at the backtrace, this happens right after the place where the
> > first patch made a change. The slots in 'cp' are {0x0, 0x7ff7ca3479b8,
> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and the
> > coredump. This crash happens during a different test (in both cases). It
> > crashes in the test for our normal upgrade procedure. The crash happens
> > when the cluster is in a fully consistent configuration, all 3 nodes are
> up
> > and healthy. node 0 is both watchdog leader and running the primary
> > database. Because node 0 is the first to be upgraded, we need to
> initiate a
> > failover. For this, we first restart pgpool to force it to drop its
> leader
> > status and wait until a new leader has been elected. We then stop the
> > database to trigger the failover. The crash seems to happen when the
> > database started up again. In the attached logs, the crash was at node 0,
> > but in the other failure, the crash was at the exact same moment in the
> > test, but at node 2. I'm also seeing some suspicious backend status
> reports
> > from pcp_node_info at that time:
> >
> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0 none
> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby 130337
> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST
> >
> > ps. I had to resend the mail with the attachments bzip2ed because it was
> > hitting the size limit.
> >
> > Best regards,
> > Emond
> >
> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> >
> >> > For the second crash (in pool_connection_pool.c:326), all slots are
> null:
> >> > (gdb) info locals
> >> > i = 0
> >> > freed = 0
> >> > closetime = 0
> >> > oldestp = 0x7f474f36e168
> >> > ret = 0x0
> >> > info = 0x55b511283469 <opt_sort>
> >> > p = 0x7f474f36e168
> >> >
> >> > (gdb) p *p
> >> > $7 = {
> >> >   info = 0x7f47470d5c08,
> >> >   slots = {0x0 <repeats 128 times>}
> >> > }
> >> >
> >> > (gdb) p p->slots[0]
> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> >> > (gdb) p p->slots[1]
> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> >> > (gdb) p p->slots[2]
> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> >>
> >> Possible explanation is, Req_info->main_node_id, which is the smallest
> >> backend node id number being alive, was -1. From the log file just
> >> before pid 30 segfaults:
> >>
> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
> >> backend:0 is set to down status
> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
> >> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux
> 216dfd5e07f2"
> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
> >> backend:1 is set to down status
> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
> >> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux
> 216dfd5e07f2"
> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
> >> backend:2 is set to down status
> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
> >> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5
> >>
> >> That means all backends were down, and Req_info->main_node_id should
> >> have been -1 at the time. Usually if all backend are down, pgpool will
> >> not accept connections from frontend. I think when the check was
> >> peformed, not all backend are down. Right after this all backend went
> >> down I guess. If so, at line 270 of pool_connection_pool.c:
> >>
> >> 270                 if (MAIN_CONNECTION(p) == NULL)
> >>
> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
> >> looked into garbage and above condition was not met. Then the code
> >> proceeded to line 326:
> >>
> >>  326                         pool_free_startup_packet(CONNECTION_SLOT(p,
> >> i)->sp);
> >>
> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
> >>
> >> I think using MAIN_CONNECTION macro to take care of connection
> >> poolings is incorrect. Which backend is alive in the each connection
> >> pool slot could be different from current alive backend. So I invented
> >> a function in_use_backend_id() for seaching live backend ids in a
> >> connection slot. Attached is the patch in this direction. I hope the
> >> patch to fix the second segfault case.
> >>
> >> > I hope this helps.
> >> >
> >> > Best regards,
> >> > Emond
> >> >
> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> >> >
> >> >> > No worries. I hope you had a good trip. Last night we triggered the
> >> last
> >> >> > crash again. Is there anything we can do to make it easier for you
> to
> >> >> find
> >> >> > the cause?
> >> >>
> >> >> It would be helpful if you could share some variable values in the
> >> >> core file. Since I don't have the pgpool load module when you got the
> >> >> core, I cannot inspect the variables using the core you provided.
> >> >>
> >> >> > #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at
> >> >> > protocol/child.c:1076
> >> >> > #1  0x000055803ce3d02a in get_backend_connection
> >> >> (frontend=0x55803eb08768)
> >> >> > at protocol/child.c:2112
> >> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
> >> >> > protocol/child.c:416
> >> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13)
> at
> >> >> > main/pgpool_main.c:863
> >> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> >> >> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
> >> >> > main/main.c:365
> >> >>
> >> >> Around protocol/child.c:1076:
> >> >>                                 /* set DB node id */
> >> >>
>  pool_set_db_node_id(CONNECTION(backend,
> >> >> i), i);
> >> >>
> >> >> I want to see the values in the "backend" struct. Since CONNECTION
> >> >> macro is used here, you have to do something like in a gdb session.
> >> >>
> >> >> p *backend->slots[0]
> >> >> p *backend->slots[1]
> >> >> p *backend->slots[2]
> >> >>
> >> >> Best reagards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS LLC
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/81b699cb/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: container.log.bz2
Type: application/x-bzip2
Size: 558987 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/81b699cb/attachment.bz2>


More information about the pgpool-general mailing list