[pgpool-general: 9141] Re: Another segmentation fault

Emond Papegaaij emond.papegaaij at gmail.com
Thu Jun 20 20:53:05 JST 2024


Ah, no problem. Here it is.

Ps. I just noticed that the commit to apply your latest patch
(fix_segfault2.patch) never got pushed to our repository. So we are using
4.5.2 with only v1-0001-Fix-segfault-in-a-child-process.patch applied. I've
just pushed the second patch and I'll keep you updated on any changes in
the builds.

Best regards,
Emond

Op do 20 jun 2024 om 13:19 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:

> Hi,
>
> I meant the pgpool executable file.  I just need
> /dest/usr/sbin/pgpool.
>
> > Hi,
> >
> > Probably, if you could explain what a 'load module' is, and how to get it
> > from a build. I run gdb inside a docker container that's also used to
> build
> > pgpool from source. I start it like this:
> > gdb /dest/usr/sbin/pgpool <path to coredump>
> >
> > Would it be enough if I create a tarball of the entire /dest directory
> (its
> > the destination for make install)?
> >
> > Best regards,
> > Emond
> >
> > Op do 20 jun 2024 om 11:46 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> >
> >> Thank for the info. Is it possible for you to share the pgpool load
> module?
> >> With it, I could inspect variable values in the core myself using gdb.
> >>
> >> > Hi,
> >> >
> >> > I've managed to get some more logging for 2 crashes. The first, in
> build
> >> > 14132 (with coredump
> >> >
> >>
> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):
> >> > #0  pool_read_message_length (cp=0x7f53440bb548) at
> >> > protocol/pool_proto_modules.c:4483
> >> > #1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,
> >> > cp=0x7f53440bb548) at auth/pool_auth.c:384
> >> > #2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,
> >> > frontend=0x56386b6c4788) at protocol/child.c:1102
> >> > #3  0x000056386aa36042 in get_backend_connection
> >> (frontend=0x56386b6c4788)
> >> > at protocol/child.c:2111
> >> > #4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at
> >> > protocol/child.c:416
> >> > #5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at
> >> > main/pgpool_main.c:863
> >> > #6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004
> >> > #7  0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953
> >> > #8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',
> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
> >> > #9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at
> >> > main/main.c:365
> >> >
> >> > The second, in build 14136:
> >> > It seems the location of the crash moved a bit, since my previous
> report
> >> > due to the higher loglevel. It now crashes in a log statement. The
> >> coredump
> >> > that goes with this crash
> >> > is:
> >>
> core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.
> >> > #0  0x0000561e716f1fa7 in pool_create_cp () at
> >> > protocol/pool_connection_pool.c:293
> >> > #1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,
> >> > frontend=0x561e726ce768) at protocol/child.c:1051
> >> > #2  0x0000561e716e0042 in get_backend_connection
> >> (frontend=0x561e726ce768)
> >> > at protocol/child.c:2111
> >> > #3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at
> >> > protocol/child.c:416
> >> > #4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at
> >> > main/pgpool_main.c:863
> >> > #5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',
> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> >> > #6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at
> >> > main/main.c:365
> >> >
> >> > Best regards,
> >> > Emond
> >> >
> >> > Op do 20 jun 2024 om 09:05 schreef Emond Papegaaij <
> >> > emond.papegaaij at gmail.com>:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm very sorry. Since we've increased the logging of the container,
> we
> >> >> sometimes seem to get some gaps in the logs during reboots. I guess
> the
> >> >> virtual machine is having trouble getting all logs out to disk
> before it
> >> >> reboots. The attached logfile does contain the logs for the
> segmentation
> >> >> fault for this crash.
> >> >>
> >> >> Best regards,
> >> >> Emond
> >> >>
> >> >> Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <ishii at sraoss.co.jp
> >:
> >> >>
> >> >>> Sorry for delay. I looked into the
> pgpool-upgrade-test-crash.log.bz2,
> >> >>> but failed to find "segmentation fault" string.
> >> >>>
> >> >>> Best reagards,
> >> >>> --
> >> >>> Tatsuo Ishii
> >> >>> SRA OSS LLC
> >> >>> English: http://www.sraoss.co.jp/index_en/
> >> >>> Japanese:http://www.sraoss.co.jp
> >> >>>
> >> >>> > Hi,
> >> >>> >
> >> >>> > Thanks for the patch. I'll add it to the build of our pgpool
> >> containers.
> >> >>> >
> >> >>> > Since I've added the previous patch, I've got the feeling that the
> >> >>> number
> >> >>> > of segmentation faults has been reduced. It's hard to say though
> if
> >> the
> >> >>> > problem is really fixed, because some of these crashes happen very
> >> >>> > infrequently.
> >> >>> >
> >> >>> > Yesterday, we did see 2 crashes, both with the same backtrace,
> which
> >> I
> >> >>> > haven't seen before:
> >> >>> > #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
> >> >>> > auth/pool_auth.c:349
> >> >>> > #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
> >> >>> > frontend=0x55bc42c93788) at protocol/child.c:1102
> >> >>> > #2  0x000055bc41859042 in get_backend_connection
> >> >>> (frontend=0x55bc42c93788)
> >> >>> > at protocol/child.c:2111
> >> >>> > #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
> >> >>> > protocol/child.c:416
> >> >>> > #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5)
> at
> >> >>> > main/pgpool_main.c:863
> >> >>> > #5  0x000055bc418256f7 in exec_child_restart
> >> >>> > (failover_context=0x7ffdf033c6b0, node_id=0) at
> >> main/pgpool_main.c:4684
> >> >>> > #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
> >> >>> > #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
> >> >>> > main/pgpool_main.c:1507
> >> >>> > #8  0x000055bc418263c3 in check_requests () at
> >> main/pgpool_main.c:4934
> >> >>> > #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
> >> >>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
> >> >>> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
> >> >>> > main/main.c:365
> >> >>> >
> >> >>> > Looking at the backtrace, this happens right after the place where
> >> the
> >> >>> > first patch made a change. The slots in 'cp' are {0x0,
> >> 0x7ff7ca3479b8,
> >> >>> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs
> and
> >> the
> >> >>> > coredump. This crash happens during a different test (in both
> >> cases). It
> >> >>> > crashes in the test for our normal upgrade procedure. The crash
> >> happens
> >> >>> > when the cluster is in a fully consistent configuration, all 3
> nodes
> >> >>> are up
> >> >>> > and healthy. node 0 is both watchdog leader and running the
> primary
> >> >>> > database. Because node 0 is the first to be upgraded, we need to
> >> >>> initiate a
> >> >>> > failover. For this, we first restart pgpool to force it to drop
> its
> >> >>> leader
> >> >>> > status and wait until a new leader has been elected. We then stop
> the
> >> >>> > database to trigger the failover. The crash seems to happen when
> the
> >> >>> > database started up again. In the attached logs, the crash was at
> >> node
> >> >>> 0,
> >> >>> > but in the other failure, the crash was at the exact same moment
> in
> >> the
> >> >>> > test, but at node 2. I'm also seeing some suspicious backend
> status
> >> >>> reports
> >> >>> > from pcp_node_info at that time:
> >> >>> >
> >> >>> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0
> >> none
> >> >>> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
> >> >>> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby
> >> >>> 130337
> >> >>> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST
> >> >>> >
> >> >>> > ps. I had to resend the mail with the attachments bzip2ed because
> it
> >> was
> >> >>> > hitting the size limit.
> >> >>> >
> >> >>> > Best regards,
> >> >>> > Emond
> >> >>> >
> >> >>> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <
> ishii at sraoss.co.jp
> >> >:
> >> >>> >
> >> >>> >> > For the second crash (in pool_connection_pool.c:326), all slots
> >> are
> >> >>> null:
> >> >>> >> > (gdb) info locals
> >> >>> >> > i = 0
> >> >>> >> > freed = 0
> >> >>> >> > closetime = 0
> >> >>> >> > oldestp = 0x7f474f36e168
> >> >>> >> > ret = 0x0
> >> >>> >> > info = 0x55b511283469 <opt_sort>
> >> >>> >> > p = 0x7f474f36e168
> >> >>> >> >
> >> >>> >> > (gdb) p *p
> >> >>> >> > $7 = {
> >> >>> >> >   info = 0x7f47470d5c08,
> >> >>> >> >   slots = {0x0 <repeats 128 times>}
> >> >>> >> > }
> >> >>> >> >
> >> >>> >> > (gdb) p p->slots[0]
> >> >>> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> >> >>> >> > (gdb) p p->slots[1]
> >> >>> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> >> >>> >> > (gdb) p p->slots[2]
> >> >>> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> >> >>> >>
> >> >>> >> Possible explanation is, Req_info->main_node_id, which is the
> >> smallest
> >> >>> >> backend node id number being alive, was -1. From the log file
> just
> >> >>> >> before pid 30 segfaults:
> >> >>> >>
> >> >>> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
> >> >>> >> backend:0 is set to down status
> >> >>> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1:
> DETAIL:
> >> >>> >> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux
> >> >>> 216dfd5e07f2"
> >> >>> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
> >> >>> >> backend:1 is set to down status
> >> >>> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1:
> DETAIL:
> >> >>> >> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux
> >> >>> 216dfd5e07f2"
> >> >>> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
> >> >>> >> backend:2 is set to down status
> >> >>> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1:
> DETAIL:
> >> >>> >> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux
> 216dfd5
> >> >>> >>
> >> >>> >> That means all backends were down, and Req_info->main_node_id
> should
> >> >>> >> have been -1 at the time. Usually if all backend are down, pgpool
> >> will
> >> >>> >> not accept connections from frontend. I think when the check was
> >> >>> >> peformed, not all backend are down. Right after this all backend
> >> went
> >> >>> >> down I guess. If so, at line 270 of pool_connection_pool.c:
> >> >>> >>
> >> >>> >> 270                 if (MAIN_CONNECTION(p) == NULL)
> >> >>> >>
> >> >>> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
> >> >>> >> looked into garbage and above condition was not met. Then the
> code
> >> >>> >> proceeded to line 326:
> >> >>> >>
> >> >>> >>  326
> >> >>>  pool_free_startup_packet(CONNECTION_SLOT(p,
> >> >>> >> i)->sp);
> >> >>> >>
> >> >>> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
> >> >>> >>
> >> >>> >> I think using MAIN_CONNECTION macro to take care of connection
> >> >>> >> poolings is incorrect. Which backend is alive in the each
> connection
> >> >>> >> pool slot could be different from current alive backend. So I
> >> invented
> >> >>> >> a function in_use_backend_id() for seaching live backend ids in a
> >> >>> >> connection slot. Attached is the patch in this direction. I hope
> the
> >> >>> >> patch to fix the second segfault case.
> >> >>> >>
> >> >>> >> > I hope this helps.
> >> >>> >> >
> >> >>> >> > Best regards,
> >> >>> >> > Emond
> >> >>> >> >
> >> >>> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <
> >> ishii at sraoss.co.jp>:
> >> >>> >> >
> >> >>> >> >> > No worries. I hope you had a good trip. Last night we
> triggered
> >> >>> the
> >> >>> >> last
> >> >>> >> >> > crash again. Is there anything we can do to make it easier
> for
> >> >>> you to
> >> >>> >> >> find
> >> >>> >> >> > the cause?
> >> >>> >> >>
> >> >>> >> >> It would be helpful if you could share some variable values in
> >> the
> >> >>> >> >> core file. Since I don't have the pgpool load module when you
> got
> >> >>> the
> >> >>> >> >> core, I cannot inspect the variables using the core you
> provided.
> >> >>> >> >>
> >> >>> >> >> > #0  connect_backend (sp=0x55803eb0a6b8,
> >> frontend=0x55803eb08768)
> >> >>> at
> >> >>> >> >> > protocol/child.c:1076
> >> >>> >> >> > #1  0x000055803ce3d02a in get_backend_connection
> >> >>> >> >> (frontend=0x55803eb08768)
> >> >>> >> >> > at protocol/child.c:2112
> >> >>> >> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
> >> >>> >> >> > protocol/child.c:416
> >> >>> >> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90,
> >> >>> id=13) at
> >> >>> >> >> > main/pgpool_main.c:863
> >> >>> >> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0
> '\000',
> >> >>> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> >> >>> >> >> > #5  0x000055803cdfb9e6 in main (argc=2,
> argv=0x7ffc8cdddda8) at
> >> >>> >> >> > main/main.c:365
> >> >>> >> >>
> >> >>> >> >> Around protocol/child.c:1076:
> >> >>> >> >>                                 /* set DB node id */
> >> >>> >> >>
> >> >>>  pool_set_db_node_id(CONNECTION(backend,
> >> >>> >> >> i), i);
> >> >>> >> >>
> >> >>> >> >> I want to see the values in the "backend" struct. Since
> >> CONNECTION
> >> >>> >> >> macro is used here, you have to do something like in a gdb
> >> session.
> >> >>> >> >>
> >> >>> >> >> p *backend->slots[0]
> >> >>> >> >> p *backend->slots[1]
> >> >>> >> >> p *backend->slots[2]
> >> >>> >> >>
> >> >>> >> >> Best reagards,
> >> >>> >> >> --
> >> >>> >> >> Tatsuo Ishii
> >> >>> >> >> SRA OSS LLC
> >> >>> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >>> >> >> Japanese:http://www.sraoss.co.jp
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/1a0f89cc/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool.bz2
Type: application/x-bzip2
Size: 1960980 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/1a0f89cc/attachment.bz2>


More information about the pgpool-general mailing list