[pgpool-general: 9142] Re: Another segmentation fault

Thu Jun 20 21:48:32 JST 2024

> Ah, no problem. Here it is.

Thanks. I will run gdb against the core files using the load module.

> Ps. I just noticed that the commit to apply your latest patch
> (fix_segfault2.patch) never got pushed to our repository. So we are using
> 4.5.2 with only v1-0001-Fix-segfault-in-a-child-process.patch applied. I've
> just pushed the second patch and I'll keep you updated on any changes in
> the builds.

Ok.

> Best regards,
> Emond
> 
> Op do 20 jun 2024 om 13:19 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> 
>> Hi,
>>
>> I meant the pgpool executable file.  I just need
>> /dest/usr/sbin/pgpool.
>>
>> > Hi,
>> >
>> > Probably, if you could explain what a 'load module' is, and how to get it
>> > from a build. I run gdb inside a docker container that's also used to
>> build
>> > pgpool from source. I start it like this:
>> > gdb /dest/usr/sbin/pgpool <path to coredump>
>> >
>> > Would it be enough if I create a tarball of the entire /dest directory
>> (its
>> > the destination for make install)?
>> >
>> > Best regards,
>> > Emond
>> >
>> > Op do 20 jun 2024 om 11:46 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>> >
>> >> Thank for the info. Is it possible for you to share the pgpool load
>> module?
>> >> With it, I could inspect variable values in the core myself using gdb.
>> >>
>> >> > Hi,
>> >> >
>> >> > I've managed to get some more logging for 2 crashes. The first, in
>> build
>> >> > 14132 (with coredump
>> >> >
>> >>
>> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):
>> >> > #0  pool_read_message_length (cp=0x7f53440bb548) at
>> >> > protocol/pool_proto_modules.c:4483
>> >> > #1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,
>> >> > cp=0x7f53440bb548) at auth/pool_auth.c:384
>> >> > #2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,
>> >> > frontend=0x56386b6c4788) at protocol/child.c:1102
>> >> > #3  0x000056386aa36042 in get_backend_connection
>> >> (frontend=0x56386b6c4788)
>> >> > at protocol/child.c:2111
>> >> > #4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at
>> >> > protocol/child.c:416
>> >> > #5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at
>> >> > main/pgpool_main.c:863
>> >> > #6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004
>> >> > #7  0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953
>> >> > #8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',
>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
>> >> > #9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at
>> >> > main/main.c:365
>> >> >
>> >> > The second, in build 14136:
>> >> > It seems the location of the crash moved a bit, since my previous
>> report
>> >> > due to the higher loglevel. It now crashes in a log statement. The
>> >> coredump
>> >> > that goes with this crash
>> >> > is:
>> >>
>> core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.
>> >> > #0  0x0000561e716f1fa7 in pool_create_cp () at
>> >> > protocol/pool_connection_pool.c:293
>> >> > #1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,
>> >> > frontend=0x561e726ce768) at protocol/child.c:1051
>> >> > #2  0x0000561e716e0042 in get_backend_connection
>> >> (frontend=0x561e726ce768)
>> >> > at protocol/child.c:2111
>> >> > #3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at
>> >> > protocol/child.c:416
>> >> > #4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at
>> >> > main/pgpool_main.c:863
>> >> > #5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',
>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> >> > #6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at
>> >> > main/main.c:365
>> >> >
>> >> > Best regards,
>> >> > Emond
>> >> >
>> >> > Op do 20 jun 2024 om 09:05 schreef Emond Papegaaij <
>> >> > emond.papegaaij at gmail.com>:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I'm very sorry. Since we've increased the logging of the container,
>> we
>> >> >> sometimes seem to get some gaps in the logs during reboots. I guess
>> the
>> >> >> virtual machine is having trouble getting all logs out to disk
>> before it
>> >> >> reboots. The attached logfile does contain the logs for the
>> segmentation
>> >> >> fault for this crash.
>> >> >>
>> >> >> Best regards,
>> >> >> Emond
>> >> >>
>> >> >> Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <ishii at sraoss.co.jp
>> >:
>> >> >>
>> >> >>> Sorry for delay. I looked into the
>> pgpool-upgrade-test-crash.log.bz2,
>> >> >>> but failed to find "segmentation fault" string.
>> >> >>>
>> >> >>> Best reagards,
>> >> >>> --
>> >> >>> Tatsuo Ishii
>> >> >>> SRA OSS LLC
>> >> >>> English: http://www.sraoss.co.jp/index_en/
>> >> >>> Japanese:http://www.sraoss.co.jp
>> >> >>>
>> >> >>> > Hi,
>> >> >>> >
>> >> >>> > Thanks for the patch. I'll add it to the build of our pgpool
>> >> containers.
>> >> >>> >
>> >> >>> > Since I've added the previous patch, I've got the feeling that the
>> >> >>> number
>> >> >>> > of segmentation faults has been reduced. It's hard to say though
>> if
>> >> the
>> >> >>> > problem is really fixed, because some of these crashes happen very
>> >> >>> > infrequently.
>> >> >>> >
>> >> >>> > Yesterday, we did see 2 crashes, both with the same backtrace,
>> which
>> >> I
>> >> >>> > haven't seen before:
>> >> >>> > #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
>> >> >>> > auth/pool_auth.c:349
>> >> >>> > #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
>> >> >>> > frontend=0x55bc42c93788) at protocol/child.c:1102
>> >> >>> > #2  0x000055bc41859042 in get_backend_connection
>> >> >>> (frontend=0x55bc42c93788)
>> >> >>> > at protocol/child.c:2111
>> >> >>> > #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
>> >> >>> > protocol/child.c:416
>> >> >>> > #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5)
>> at
>> >> >>> > main/pgpool_main.c:863
>> >> >>> > #5  0x000055bc418256f7 in exec_child_restart
>> >> >>> > (failover_context=0x7ffdf033c6b0, node_id=0) at
>> >> main/pgpool_main.c:4684
>> >> >>> > #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
>> >> >>> > #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
>> >> >>> > main/pgpool_main.c:1507
>> >> >>> > #8  0x000055bc418263c3 in check_requests () at
>> >> main/pgpool_main.c:4934
>> >> >>> > #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
>> >> >>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
>> >> >>> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
>> >> >>> > main/main.c:365
>> >> >>> >
>> >> >>> > Looking at the backtrace, this happens right after the place where
>> >> the
>> >> >>> > first patch made a change. The slots in 'cp' are {0x0,
>> >> 0x7ff7ca3479b8,
>> >> >>> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs
>> and
>> >> the
>> >> >>> > coredump. This crash happens during a different test (in both
>> >> cases). It
>> >> >>> > crashes in the test for our normal upgrade procedure. The crash
>> >> happens
>> >> >>> > when the cluster is in a fully consistent configuration, all 3
>> nodes
>> >> >>> are up
>> >> >>> > and healthy. node 0 is both watchdog leader and running the
>> primary
>> >> >>> > database. Because node 0 is the first to be upgraded, we need to
>> >> >>> initiate a
>> >> >>> > failover. For this, we first restart pgpool to force it to drop
>> its
>> >> >>> leader
>> >> >>> > status and wait until a new leader has been elected. We then stop
>> the
>> >> >>> > database to trigger the failover. The crash seems to happen when
>> the
>> >> >>> > database started up again. In the attached logs, the crash was at
>> >> node
>> >> >>> 0,
>> >> >>> > but in the other failure, the crash was at the exact same moment
>> in
>> >> the
>> >> >>> > test, but at node 2. I'm also seeing some suspicious backend
>> status
>> >> >>> reports
>> >> >>> > from pcp_node_info at that time:
>> >> >>> >
>> >> >>> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0
>> >> none
>> >> >>> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
>> >> >>> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby
>> >> >>> 130337
>> >> >>> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST
>> >> >>> >
>> >> >>> > ps. I had to resend the mail with the attachments bzip2ed because
>> it
>> >> was
>> >> >>> > hitting the size limit.
>> >> >>> >
>> >> >>> > Best regards,
>> >> >>> > Emond
>> >> >>> >
>> >> >>> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <
>> ishii at sraoss.co.jp
>> >> >:
>> >> >>> >
>> >> >>> >> > For the second crash (in pool_connection_pool.c:326), all slots
>> >> are
>> >> >>> null:
>> >> >>> >> > (gdb) info locals
>> >> >>> >> > i = 0
>> >> >>> >> > freed = 0
>> >> >>> >> > closetime = 0
>> >> >>> >> > oldestp = 0x7f474f36e168
>> >> >>> >> > ret = 0x0
>> >> >>> >> > info = 0x55b511283469 <opt_sort>
>> >> >>> >> > p = 0x7f474f36e168
>> >> >>> >> >
>> >> >>> >> > (gdb) p *p
>> >> >>> >> > $7 = {
>> >> >>> >> >   info = 0x7f47470d5c08,
>> >> >>> >> >   slots = {0x0 <repeats 128 times>}
>> >> >>> >> > }
>> >> >>> >> >
>> >> >>> >> > (gdb) p p->slots[0]
>> >> >>> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >> >>> >> > (gdb) p p->slots[1]
>> >> >>> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >> >>> >> > (gdb) p p->slots[2]
>> >> >>> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >> >>> >>
>> >> >>> >> Possible explanation is, Req_info->main_node_id, which is the
>> >> smallest
>> >> >>> >> backend node id number being alive, was -1. From the log file
>> just
>> >> >>> >> before pid 30 segfaults:
>> >> >>> >>
>> >> >>> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >> >>> >> backend:0 is set to down status
>> >> >>> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1:
>> DETAIL:
>> >> >>> >> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux
>> >> >>> 216dfd5e07f2"
>> >> >>> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >> >>> >> backend:1 is set to down status
>> >> >>> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1:
>> DETAIL:
>> >> >>> >> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux
>> >> >>> 216dfd5e07f2"
>> >> >>> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >> >>> >> backend:2 is set to down status
>> >> >>> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1:
>> DETAIL:
>> >> >>> >> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux
>> 216dfd5
>> >> >>> >>
>> >> >>> >> That means all backends were down, and Req_info->main_node_id
>> should
>> >> >>> >> have been -1 at the time. Usually if all backend are down, pgpool
>> >> will
>> >> >>> >> not accept connections from frontend. I think when the check was
>> >> >>> >> peformed, not all backend are down. Right after this all backend
>> >> went
>> >> >>> >> down I guess. If so, at line 270 of pool_connection_pool.c:
>> >> >>> >>
>> >> >>> >> 270                 if (MAIN_CONNECTION(p) == NULL)
>> >> >>> >>
>> >> >>> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
>> >> >>> >> looked into garbage and above condition was not met. Then the
>> code
>> >> >>> >> proceeded to line 326:
>> >> >>> >>
>> >> >>> >>  326
>> >> >>>  pool_free_startup_packet(CONNECTION_SLOT(p,
>> >> >>> >> i)->sp);
>> >> >>> >>
>> >> >>> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
>> >> >>> >>
>> >> >>> >> I think using MAIN_CONNECTION macro to take care of connection
>> >> >>> >> poolings is incorrect. Which backend is alive in the each
>> connection
>> >> >>> >> pool slot could be different from current alive backend. So I
>> >> invented
>> >> >>> >> a function in_use_backend_id() for seaching live backend ids in a
>> >> >>> >> connection slot. Attached is the patch in this direction. I hope
>> the
>> >> >>> >> patch to fix the second segfault case.
>> >> >>> >>
>> >> >>> >> > I hope this helps.
>> >> >>> >> >
>> >> >>> >> > Best regards,
>> >> >>> >> > Emond
>> >> >>> >> >
>> >> >>> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <
>> >> ishii at sraoss.co.jp>:
>> >> >>> >> >
>> >> >>> >> >> > No worries. I hope you had a good trip. Last night we
>> triggered
>> >> >>> the
>> >> >>> >> last
>> >> >>> >> >> > crash again. Is there anything we can do to make it easier
>> for
>> >> >>> you to
>> >> >>> >> >> find
>> >> >>> >> >> > the cause?
>> >> >>> >> >>
>> >> >>> >> >> It would be helpful if you could share some variable values in
>> >> the
>> >> >>> >> >> core file. Since I don't have the pgpool load module when you
>> got
>> >> >>> the
>> >> >>> >> >> core, I cannot inspect the variables using the core you
>> provided.
>> >> >>> >> >>
>> >> >>> >> >> > #0  connect_backend (sp=0x55803eb0a6b8,
>> >> frontend=0x55803eb08768)
>> >> >>> at
>> >> >>> >> >> > protocol/child.c:1076
>> >> >>> >> >> > #1  0x000055803ce3d02a in get_backend_connection
>> >> >>> >> >> (frontend=0x55803eb08768)
>> >> >>> >> >> > at protocol/child.c:2112
>> >> >>> >> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>> >> >>> >> >> > protocol/child.c:416
>> >> >>> >> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90,
>> >> >>> id=13) at
>> >> >>> >> >> > main/pgpool_main.c:863
>> >> >>> >> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0
>> '\000',
>> >> >>> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> >> >>> >> >> > #5  0x000055803cdfb9e6 in main (argc=2,
>> argv=0x7ffc8cdddda8) at
>> >> >>> >> >> > main/main.c:365
>> >> >>> >> >>
>> >> >>> >> >> Around protocol/child.c:1076:
>> >> >>> >> >>                                 /* set DB node id */
>> >> >>> >> >>
>> >> >>>  pool_set_db_node_id(CONNECTION(backend,
>> >> >>> >> >> i), i);
>> >> >>> >> >>
>> >> >>> >> >> I want to see the values in the "backend" struct. Since
>> >> CONNECTION
>> >> >>> >> >> macro is used here, you have to do something like in a gdb
>> >> session.
>> >> >>> >> >>
>> >> >>> >> >> p *backend->slots[0]
>> >> >>> >> >> p *backend->slots[1]
>> >> >>> >> >> p *backend->slots[2]
>> >> >>> >> >>
>> >> >>> >> >> Best reagards,
>> >> >>> >> >> --
>> >> >>> >> >> Tatsuo Ishii
>> >> >>> >> >> SRA OSS LLC
>> >> >>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>
>> >>
>>