[pgpool-general: 9137] Re: Another segmentation fault

Thu Jun 20 17:55:33 JST 2024

Hi,

I've managed to get some more logging for 2 crashes. The first, in build
14132 (with coredump
core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):
#0  pool_read_message_length (cp=0x7f53440bb548) at
protocol/pool_proto_modules.c:4483
#1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,
cp=0x7f53440bb548) at auth/pool_auth.c:384
#2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,
frontend=0x56386b6c4788) at protocol/child.c:1102
#3  0x000056386aa36042 in get_backend_connection (frontend=0x56386b6c4788)
at protocol/child.c:2111
#4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at
protocol/child.c:416
#5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at
main/pgpool_main.c:863
#6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004
#7  0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953
#8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',
clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
#9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at
main/main.c:365

The second, in build 14136:
It seems the location of the crash moved a bit, since my previous report
due to the higher loglevel. It now crashes in a log statement. The coredump
that goes with this crash
is: core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.
#0  0x0000561e716f1fa7 in pool_create_cp () at
protocol/pool_connection_pool.c:293
#1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,
frontend=0x561e726ce768) at protocol/child.c:1051
#2  0x0000561e716e0042 in get_backend_connection (frontend=0x561e726ce768)
at protocol/child.c:2111
#3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at
protocol/child.c:416
#4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at
main/pgpool_main.c:863
#5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',
clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
#6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at
main/main.c:365

Best regards,
Emond

Op do 20 jun 2024 om 09:05 schreef Emond Papegaaij <
emond.papegaaij at gmail.com>:

> Hi,
>
> I'm very sorry. Since we've increased the logging of the container, we
> sometimes seem to get some gaps in the logs during reboots. I guess the
> virtual machine is having trouble getting all logs out to disk before it
> reboots. The attached logfile does contain the logs for the segmentation
> fault for this crash.
>
> Best regards,
> Emond
>
> Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>
>> Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,
>> but failed to find "segmentation fault" string.
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi,
>> >
>> > Thanks for the patch. I'll add it to the build of our pgpool containers.
>> >
>> > Since I've added the previous patch, I've got the feeling that the
>> number
>> > of segmentation faults has been reduced. It's hard to say though if the
>> > problem is really fixed, because some of these crashes happen very
>> > infrequently.
>> >
>> > Yesterday, we did see 2 crashes, both with the same backtrace, which I
>> > haven't seen before:
>> > #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
>> > auth/pool_auth.c:349
>> > #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
>> > frontend=0x55bc42c93788) at protocol/child.c:1102
>> > #2  0x000055bc41859042 in get_backend_connection
>> (frontend=0x55bc42c93788)
>> > at protocol/child.c:2111
>> > #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
>> > protocol/child.c:416
>> > #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at
>> > main/pgpool_main.c:863
>> > #5  0x000055bc418256f7 in exec_child_restart
>> > (failover_context=0x7ffdf033c6b0, node_id=0) at main/pgpool_main.c:4684
>> > #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
>> > #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
>> > main/pgpool_main.c:1507
>> > #8  0x000055bc418263c3 in check_requests () at main/pgpool_main.c:4934
>> > #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
>> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
>> > main/main.c:365
>> >
>> > Looking at the backtrace, this happens right after the place where the
>> > first patch made a change. The slots in 'cp' are {0x0, 0x7ff7ca3479b8,
>> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and the
>> > coredump. This crash happens during a different test (in both cases). It
>> > crashes in the test for our normal upgrade procedure. The crash happens
>> > when the cluster is in a fully consistent configuration, all 3 nodes
>> are up
>> > and healthy. node 0 is both watchdog leader and running the primary
>> > database. Because node 0 is the first to be upgraded, we need to
>> initiate a
>> > failover. For this, we first restart pgpool to force it to drop its
>> leader
>> > status and wait until a new leader has been elected. We then stop the
>> > database to trigger the failover. The crash seems to happen when the
>> > database started up again. In the attached logs, the crash was at node
>> 0,
>> > but in the other failure, the crash was at the exact same moment in the
>> > test, but at node 2. I'm also seeing some suspicious backend status
>> reports
>> > from pcp_node_info at that time:
>> >
>> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0 none
>> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
>> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby
>> 130337
>> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST
>> >
>> > ps. I had to resend the mail with the attachments bzip2ed because it was
>> > hitting the size limit.
>> >
>> > Best regards,
>> > Emond
>> >
>> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>> >
>> >> > For the second crash (in pool_connection_pool.c:326), all slots are
>> null:
>> >> > (gdb) info locals
>> >> > i = 0
>> >> > freed = 0
>> >> > closetime = 0
>> >> > oldestp = 0x7f474f36e168
>> >> > ret = 0x0
>> >> > info = 0x55b511283469 <opt_sort>
>> >> > p = 0x7f474f36e168
>> >> >
>> >> > (gdb) p *p
>> >> > $7 = {
>> >> >   info = 0x7f47470d5c08,
>> >> >   slots = {0x0 <repeats 128 times>}
>> >> > }
>> >> >
>> >> > (gdb) p p->slots[0]
>> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >> > (gdb) p p->slots[1]
>> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >> > (gdb) p p->slots[2]
>> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> >>
>> >> Possible explanation is, Req_info->main_node_id, which is the smallest
>> >> backend node id number being alive, was -1. From the log file just
>> >> before pid 30 segfaults:
>> >>
>> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >> backend:0 is set to down status
>> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> >> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux
>> 216dfd5e07f2"
>> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >> backend:1 is set to down status
>> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> >> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux
>> 216dfd5e07f2"
>> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> >> backend:2 is set to down status
>> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> >> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5
>> >>
>> >> That means all backends were down, and Req_info->main_node_id should
>> >> have been -1 at the time. Usually if all backend are down, pgpool will
>> >> not accept connections from frontend. I think when the check was
>> >> peformed, not all backend are down. Right after this all backend went
>> >> down I guess. If so, at line 270 of pool_connection_pool.c:
>> >>
>> >> 270                 if (MAIN_CONNECTION(p) == NULL)
>> >>
>> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
>> >> looked into garbage and above condition was not met. Then the code
>> >> proceeded to line 326:
>> >>
>> >>  326
>>  pool_free_startup_packet(CONNECTION_SLOT(p,
>> >> i)->sp);
>> >>
>> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
>> >>
>> >> I think using MAIN_CONNECTION macro to take care of connection
>> >> poolings is incorrect. Which backend is alive in the each connection
>> >> pool slot could be different from current alive backend. So I invented
>> >> a function in_use_backend_id() for seaching live backend ids in a
>> >> connection slot. Attached is the patch in this direction. I hope the
>> >> patch to fix the second segfault case.
>> >>
>> >> > I hope this helps.
>> >> >
>> >> > Best regards,
>> >> > Emond
>> >> >
>> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>> >> >
>> >> >> > No worries. I hope you had a good trip. Last night we triggered
>> the
>> >> last
>> >> >> > crash again. Is there anything we can do to make it easier for
>> you to
>> >> >> find
>> >> >> > the cause?
>> >> >>
>> >> >> It would be helpful if you could share some variable values in the
>> >> >> core file. Since I don't have the pgpool load module when you got
>> the
>> >> >> core, I cannot inspect the variables using the core you provided.
>> >> >>
>> >> >> > #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768)
>> at
>> >> >> > protocol/child.c:1076
>> >> >> > #1  0x000055803ce3d02a in get_backend_connection
>> >> >> (frontend=0x55803eb08768)
>> >> >> > at protocol/child.c:2112
>> >> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>> >> >> > protocol/child.c:416
>> >> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90,
>> id=13) at
>> >> >> > main/pgpool_main.c:863
>> >> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
>> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> >> >> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
>> >> >> > main/main.c:365
>> >> >>
>> >> >> Around protocol/child.c:1076:
>> >> >>                                 /* set DB node id */
>> >> >>
>>  pool_set_db_node_id(CONNECTION(backend,
>> >> >> i), i);
>> >> >>
>> >> >> I want to see the values in the "backend" struct. Since CONNECTION
>> >> >> macro is used here, you have to do something like in a gdb session.
>> >> >>
>> >> >> p *backend->slots[0]
>> >> >> p *backend->slots[1]
>> >> >> p *backend->slots[2]
>> >> >>
>> >> >> Best reagards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS LLC
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/9dd91850/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 14136.log.bz2
Type: application/x-bzip2
Size: 54691 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/9dd91850/attachment.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 14132.log.bz2
Type: application/x-bzip2
Size: 554601 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/9dd91850/attachment-0001.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2
Type: application/x-bzip2
Size: 925961 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/9dd91850/attachment-0002.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2
Type: application/x-bzip2
Size: 904146 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240620/9dd91850/attachment-0003.bz2>