[pgpool-general: 9140] Re: Another segmentation fault

Thu Jun 20 20:19:35 JST 2024

Hi,

I meant the pgpool executable file.  I just need
/dest/usr/sbin/pgpool.

> Hi,
> 
> Probably, if you could explain what a 'load module' is, and how to get it
> from a build. I run gdb inside a docker container that's also used to build
> pgpool from source. I start it like this:
> gdb /dest/usr/sbin/pgpool <path to coredump>
> 
> Would it be enough if I create a tarball of the entire /dest directory (its
> the destination for make install)?
> 
> Best regards,
> Emond
> 
> Op do 20 jun 2024 om 11:46 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> 
>> Thank for the info. Is it possible for you to share the pgpool load module?
>> With it, I could inspect variable values in the core myself using gdb.
>>
>> > Hi,
>> >
>> > I've managed to get some more logging for 2 crashes. The first, in build
>> > 14132 (with coredump
>> >
>> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):
>> > #0  pool_read_message_length (cp=0x7f53440bb548) at
>> > protocol/pool_proto_modules.c:4483
>> > #1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,
>> > cp=0x7f53440bb548) at auth/pool_auth.c:384
>> > #2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,
>> > frontend=0x56386b6c4788) at protocol/child.c:1102
>> > #3  0x000056386aa36042 in get_backend_connection
>> (frontend=0x56386b6c4788)
>> > at protocol/child.c:2111
>> > #4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at
>> > protocol/child.c:416
>> > #5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at
>> > main/pgpool_main.c:863
>> > #6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004
>> > #7  0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953
>> > #8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',
>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
>> > #9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at
>> > main/main.c:365
>> >
>> > The second, in build 14136:
>> > It seems the location of the crash moved a bit, since my previous report
>> > due to the higher loglevel. It now crashes in a log statement. The
>> coredump
>> > that goes with this crash
>> > is:
>> core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.
>> > #0  0x0000561e716f1fa7 in pool_create_cp () at
>> > protocol/pool_connection_pool.c:293
>> > #1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,
>> > frontend=0x561e726ce768) at protocol/child.c:1051
>> > #2  0x0000561e716e0042 in get_backend_connection
>> (frontend=0x561e726ce768)
>> > at protocol/child.c:2111
>> > #3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at
>> > protocol/child.c:416
>> > #4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at
>> > main/pgpool_main.c:863
>> > #5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',
>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> > #6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at
>> > main/main.c:365
>> >
>> > Best regards,
>> > Emond
>> >
>> > Op do 20 jun 2024 om 09:05 schreef Emond Papegaaij <
>> > emond.papegaaij at gmail.com>:
>> >
>> >> Hi,
>> >>
>> >> I'm very sorry. Since we've increased the logging of the container, we
>> >> sometimes seem to get some gaps in the logs during reboots. I guess the
>> >> virtual machine is having trouble getting all logs out to disk before it
>> >> reboots. The attached logfile does contain the logs for the segmentation
>> >> fault for this crash.
>> >>
>> >> Best regards,
>> >> Emond
>> >>
>> >> Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>> >>
>> >>> Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,
>> >>> but failed to find "segmentation fault" string.
>> >>>
>> >>> Best reagards,
>> >>> --
>> >>> Tatsuo Ishii
>> >>> SRA OSS LLC
>> >>> English: http://www.sraoss.co.jp/index_en/
>> >>> Japanese:http://www.sraoss.co.jp
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> > Thanks for the patch. I'll add it to the build of our pgpool
>> containers.
>> >>> >
>> >>> > Since I've added the previous patch, I've got the feeling that the
>> >>> number
>> >>> > of segmentation faults has been reduced. It's hard to say though if
>> the
>> >>> > problem is really fixed, because some of these crashes happen very
>> >>> > infrequently.
>> >>> >
>> >>> > Yesterday, we did see 2 crashes, both with the same backtrace, which
>> I
>> >>> > haven't seen before:
>> >>> > #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
>> >>> > auth/pool_auth.c:349
>> >>> > #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
>> >>> > frontend=0x55bc42c93788) at protocol/child.c:1102
>> >>> > #2  0x000055bc41859042 in get_backend_connection
>> >>> (frontend=0x55bc42c93788)
>> >>> > at protocol/child.c:2111
>> >>> > #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
>> >>> > protocol/child.c:416
>> >>> > #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at
>> >>> > main/pgpool_main.c:863
>> >>> > #5  0x000055bc418256f7 in exec_child_restart
>> >>> > (failover_context=0x7ffdf033c6b0, node_id=0) at
>> main/pgpool_main.c:4684
>> >>> > #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
>> >>> > #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
>> >>> > main/pgpool_main.c:1507
>> >>> > #8  0x000055bc418263c3 in check_requests () at
>> main/pgpool_main.c:4934
>> >>> > #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
>> >>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
>> >>> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
>> >>> > main/main.c:365
>> >>> >
>> >>> > Looking at the backtrace, this happens right after the place where
>> the
>> >>> > first patch made a change. The slots in 'cp' are {0x0,
>> 0x7ff7ca3479b8,
>> >>> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and
>> the
>> >>> > coredump. This crash happens during a different test (in both
>> cases). It
>> >>> > crashes in the test for our normal upgrade procedure. The crash
>> happens
>> >>> > when the cluster is in a fully consistent configuration, all 3 nodes
>> >>> are up
>> >>> > and healthy. node 0 is both watchdog leader and running the primary
>> >>> > database. Because node 0 is the first to be upgraded, we need to
>> >>> initiate a
>> >>> > failover. For this, we first restart pgpool to force it to drop its
>> >>> leader
>> >>> > status and wait until a new leader has been elected. We then stop the
>> >>> > database to trigger the failover. The crash seems to happen when the
>> >>> > database started up again. In the attached logs, the crash was at
>> node
>> >>> 0,
>> >>> > but in the other failure, the crash was at the exact same moment in
>> the
>> >>> > test, but at node 2. I'm also seeing some suspicious backend status
>> >>> reports
>> >>> > from pcp_node_info at that time:
>> >>> >
>> >>> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0
>> none
>> >>> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
>> >>> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby
>> >>> 130337
>> >>> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST
>> >>> >
>> >>> > ps. I had to resend the mail with the attachments bzip2ed because it
>> was
>> >>> > hitting the size limit.
>> >>> >
>> >>> > Best regards,
>> >>> > Emond
>> >>> >
>> >>> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <ishii at sraoss.co.jp
>> >:
>> >>> >
>> >>> >> > For the second crash (in pool_connection_pool.c:326), all slots
>> are
>> >>> null:
>> >>> >> > (gdb) info locals
>> >>> >> > i = 0
>> >>> >> > freed = 0
>> >>> >> > closetime = 0
>> >>> >> > oldestp = 0x7f474f36e168
>> >>> >> > ret = 0x0
>> >>> >> > info = 0x55b511283469 <opt_sort>
>> >>> >> > p = 0x7f474f36e168
>> >>> >> >
>> >>> >> > (gdb) p *p
>> >>> >> > $7 = {
>> >>> >> >   info = 0x7f47470d5c08,
>> >>> >> >   slots = {0x0 <repeats 128 times>}
>> >>> >> > }
>> >>> >> >
>> >>> >> > (gdb) p p->slots[0]
>> >>> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >>> >> > (gdb) p p->slots[1]
>> >>> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >>> >> > (gdb) p p->slots[2]
>> >>> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >>> >>
>> >>> >> Possible explanation is, Req_info->main_node_id, which is the
>> smallest
>> >>> >> backend node id number being alive, was -1. From the log file just
>> >>> >> before pid 30 segfaults:
>> >>> >>
>> >>> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >>> >> backend:0 is set to down status
>> >>> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> >>> >> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux
>> >>> 216dfd5e07f2"
>> >>> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >>> >> backend:1 is set to down status
>> >>> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> >>> >> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux
>> >>> 216dfd5e07f2"
>> >>> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >>> >> backend:2 is set to down status
>> >>> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> >>> >> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5
>> >>> >>
>> >>> >> That means all backends were down, and Req_info->main_node_id should
>> >>> >> have been -1 at the time. Usually if all backend are down, pgpool
>> will
>> >>> >> not accept connections from frontend. I think when the check was
>> >>> >> peformed, not all backend are down. Right after this all backend
>> went
>> >>> >> down I guess. If so, at line 270 of pool_connection_pool.c:
>> >>> >>
>> >>> >> 270                 if (MAIN_CONNECTION(p) == NULL)
>> >>> >>
>> >>> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
>> >>> >> looked into garbage and above condition was not met. Then the code
>> >>> >> proceeded to line 326:
>> >>> >>
>> >>> >>  326
>> >>>  pool_free_startup_packet(CONNECTION_SLOT(p,
>> >>> >> i)->sp);
>> >>> >>
>> >>> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
>> >>> >>
>> >>> >> I think using MAIN_CONNECTION macro to take care of connection
>> >>> >> poolings is incorrect. Which backend is alive in the each connection
>> >>> >> pool slot could be different from current alive backend. So I
>> invented
>> >>> >> a function in_use_backend_id() for seaching live backend ids in a
>> >>> >> connection slot. Attached is the patch in this direction. I hope the
>> >>> >> patch to fix the second segfault case.
>> >>> >>
>> >>> >> > I hope this helps.
>> >>> >> >
>> >>> >> > Best regards,
>> >>> >> > Emond
>> >>> >> >
>> >>> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <
>> ishii at sraoss.co.jp>:
>> >>> >> >
>> >>> >> >> > No worries. I hope you had a good trip. Last night we triggered
>> >>> the
>> >>> >> last
>> >>> >> >> > crash again. Is there anything we can do to make it easier for
>> >>> you to
>> >>> >> >> find
>> >>> >> >> > the cause?
>> >>> >> >>
>> >>> >> >> It would be helpful if you could share some variable values in
>> the
>> >>> >> >> core file. Since I don't have the pgpool load module when you got
>> >>> the
>> >>> >> >> core, I cannot inspect the variables using the core you provided.
>> >>> >> >>
>> >>> >> >> > #0  connect_backend (sp=0x55803eb0a6b8,
>> frontend=0x55803eb08768)
>> >>> at
>> >>> >> >> > protocol/child.c:1076
>> >>> >> >> > #1  0x000055803ce3d02a in get_backend_connection
>> >>> >> >> (frontend=0x55803eb08768)
>> >>> >> >> > at protocol/child.c:2112
>> >>> >> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>> >>> >> >> > protocol/child.c:416
>> >>> >> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90,
>> >>> id=13) at
>> >>> >> >> > main/pgpool_main.c:863
>> >>> >> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
>> >>> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> >>> >> >> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
>> >>> >> >> > main/main.c:365
>> >>> >> >>
>> >>> >> >> Around protocol/child.c:1076:
>> >>> >> >>                                 /* set DB node id */
>> >>> >> >>
>> >>>  pool_set_db_node_id(CONNECTION(backend,
>> >>> >> >> i), i);
>> >>> >> >>
>> >>> >> >> I want to see the values in the "backend" struct. Since
>> CONNECTION
>> >>> >> >> macro is used here, you have to do something like in a gdb
>> session.
>> >>> >> >>
>> >>> >> >> p *backend->slots[0]
>> >>> >> >> p *backend->slots[1]
>> >>> >> >> p *backend->slots[2]
>> >>> >> >>
>> >>> >> >> Best reagards,
>> >>> >> >> --
>> >>> >> >> Tatsuo Ishii
>> >>> >> >> SRA OSS LLC
>> >>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >> >> Japanese:http://www.sraoss.co.jp
>> >>> >> >>
>> >>> >>
>> >>>
>> >>
>>