[pgpool-general: 9138] Re: Another segmentation fault

Thu Jun 20 18:46:15 JST 2024

Thank for the info. Is it possible for you to share the pgpool load module?
With it, I could inspect variable values in the core myself using gdb.

> Hi,
> 
> I've managed to get some more logging for 2 crashes. The first, in build
> 14132 (with coredump
> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):
> #0  pool_read_message_length (cp=0x7f53440bb548) at
> protocol/pool_proto_modules.c:4483
> #1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,
> cp=0x7f53440bb548) at auth/pool_auth.c:384
> #2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,
> frontend=0x56386b6c4788) at protocol/child.c:1102
> #3  0x000056386aa36042 in get_backend_connection (frontend=0x56386b6c4788)
> at protocol/child.c:2111
> #4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at
> protocol/child.c:416
> #5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at
> main/pgpool_main.c:863
> #6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004
> #7  0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953
> #8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',
> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
> #9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at
> main/main.c:365
> 
> The second, in build 14136:
> It seems the location of the crash moved a bit, since my previous report
> due to the higher loglevel. It now crashes in a log statement. The coredump
> that goes with this crash
> is: core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.
> #0  0x0000561e716f1fa7 in pool_create_cp () at
> protocol/pool_connection_pool.c:293
> #1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,
> frontend=0x561e726ce768) at protocol/child.c:1051
> #2  0x0000561e716e0042 in get_backend_connection (frontend=0x561e726ce768)
> at protocol/child.c:2111
> #3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at
> protocol/child.c:416
> #4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at
> main/pgpool_main.c:863
> #5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',
> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> #6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at
> main/main.c:365
> 
> Best regards,
> Emond
> 
> Op do 20 jun 2024 om 09:05 schreef Emond Papegaaij <
> emond.papegaaij at gmail.com>:
> 
>> Hi,
>>
>> I'm very sorry. Since we've increased the logging of the container, we
>> sometimes seem to get some gaps in the logs during reboots. I guess the
>> virtual machine is having trouble getting all logs out to disk before it
>> reboots. The attached logfile does contain the logs for the segmentation
>> fault for this crash.
>>
>> Best regards,
>> Emond
>>
>> Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>>
>>> Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,
>>> but failed to find "segmentation fault" string.
>>>
>>> Best reagards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS LLC
>>> English: http://www.sraoss.co.jp/index_en/
>>> Japanese:http://www.sraoss.co.jp
>>>
>>> > Hi,
>>> >
>>> > Thanks for the patch. I'll add it to the build of our pgpool containers.
>>> >
>>> > Since I've added the previous patch, I've got the feeling that the
>>> number
>>> > of segmentation faults has been reduced. It's hard to say though if the
>>> > problem is really fixed, because some of these crashes happen very
>>> > infrequently.
>>> >
>>> > Yesterday, we did see 2 crashes, both with the same backtrace, which I
>>> > haven't seen before:
>>> > #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
>>> > auth/pool_auth.c:349
>>> > #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
>>> > frontend=0x55bc42c93788) at protocol/child.c:1102
>>> > #2  0x000055bc41859042 in get_backend_connection
>>> (frontend=0x55bc42c93788)
>>> > at protocol/child.c:2111
>>> > #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
>>> > protocol/child.c:416
>>> > #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at
>>> > main/pgpool_main.c:863
>>> > #5  0x000055bc418256f7 in exec_child_restart
>>> > (failover_context=0x7ffdf033c6b0, node_id=0) at main/pgpool_main.c:4684
>>> > #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
>>> > #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
>>> > main/pgpool_main.c:1507
>>> > #8  0x000055bc418263c3 in check_requests () at main/pgpool_main.c:4934
>>> > #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
>>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
>>> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
>>> > main/main.c:365
>>> >
>>> > Looking at the backtrace, this happens right after the place where the
>>> > first patch made a change. The slots in 'cp' are {0x0, 0x7ff7ca3479b8,
>>> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and the
>>> > coredump. This crash happens during a different test (in both cases). It
>>> > crashes in the test for our normal upgrade procedure. The crash happens
>>> > when the cluster is in a fully consistent configuration, all 3 nodes
>>> are up
>>> > and healthy. node 0 is both watchdog leader and running the primary
>>> > database. Because node 0 is the first to be upgraded, we need to
>>> initiate a
>>> > failover. For this, we first restart pgpool to force it to drop its
>>> leader
>>> > status and wait until a new leader has been elected. We then stop the
>>> > database to trigger the failover. The crash seems to happen when the
>>> > database started up again. In the attached logs, the crash was at node
>>> 0,
>>> > but in the other failure, the crash was at the exact same moment in the
>>> > test, but at node 2. I'm also seeing some suspicious backend status
>>> reports
>>> > from pcp_node_info at that time:
>>> >
>>> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0 none
>>> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
>>> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby
>>> 130337
>>> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST
>>> >
>>> > ps. I had to resend the mail with the attachments bzip2ed because it was
>>> > hitting the size limit.
>>> >
>>> > Best regards,
>>> > Emond
>>> >
>>> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>>> >
>>> >> > For the second crash (in pool_connection_pool.c:326), all slots are
>>> null:
>>> >> > (gdb) info locals
>>> >> > i = 0
>>> >> > freed = 0
>>> >> > closetime = 0
>>> >> > oldestp = 0x7f474f36e168
>>> >> > ret = 0x0
>>> >> > info = 0x55b511283469 <opt_sort>
>>> >> > p = 0x7f474f36e168
>>> >> >
>>> >> > (gdb) p *p
>>> >> > $7 = {
>>> >> >   info = 0x7f47470d5c08,
>>> >> >   slots = {0x0 <repeats 128 times>}
>>> >> > }
>>> >> >
>>> >> > (gdb) p p->slots[0]
>>> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>>> >> > (gdb) p p->slots[1]
>>> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>>> >> > (gdb) p p->slots[2]
>>> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>>> >>
>>> >> Possible explanation is, Req_info->main_node_id, which is the smallest
>>> >> backend node id number being alive, was -1. From the log file just
>>> >> before pid 30 segfaults:
>>> >>
>>> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
>>> >> backend:0 is set to down status
>>> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>>> >> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux
>>> 216dfd5e07f2"
>>> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
>>> >> backend:1 is set to down status
>>> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>>> >> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux
>>> 216dfd5e07f2"
>>> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
>>> >> backend:2 is set to down status
>>> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>>> >> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5
>>> >>
>>> >> That means all backends were down, and Req_info->main_node_id should
>>> >> have been -1 at the time. Usually if all backend are down, pgpool will
>>> >> not accept connections from frontend. I think when the check was
>>> >> peformed, not all backend are down. Right after this all backend went
>>> >> down I guess. If so, at line 270 of pool_connection_pool.c:
>>> >>
>>> >> 270                 if (MAIN_CONNECTION(p) == NULL)
>>> >>
>>> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
>>> >> looked into garbage and above condition was not met. Then the code
>>> >> proceeded to line 326:
>>> >>
>>> >>  326
>>>  pool_free_startup_packet(CONNECTION_SLOT(p,
>>> >> i)->sp);
>>> >>
>>> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
>>> >>
>>> >> I think using MAIN_CONNECTION macro to take care of connection
>>> >> poolings is incorrect. Which backend is alive in the each connection
>>> >> pool slot could be different from current alive backend. So I invented
>>> >> a function in_use_backend_id() for seaching live backend ids in a
>>> >> connection slot. Attached is the patch in this direction. I hope the
>>> >> patch to fix the second segfault case.
>>> >>
>>> >> > I hope this helps.
>>> >> >
>>> >> > Best regards,
>>> >> > Emond
>>> >> >
>>> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>>> >> >
>>> >> >> > No worries. I hope you had a good trip. Last night we triggered
>>> the
>>> >> last
>>> >> >> > crash again. Is there anything we can do to make it easier for
>>> you to
>>> >> >> find
>>> >> >> > the cause?
>>> >> >>
>>> >> >> It would be helpful if you could share some variable values in the
>>> >> >> core file. Since I don't have the pgpool load module when you got
>>> the
>>> >> >> core, I cannot inspect the variables using the core you provided.
>>> >> >>
>>> >> >> > #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768)
>>> at
>>> >> >> > protocol/child.c:1076
>>> >> >> > #1  0x000055803ce3d02a in get_backend_connection
>>> >> >> (frontend=0x55803eb08768)
>>> >> >> > at protocol/child.c:2112
>>> >> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>>> >> >> > protocol/child.c:416
>>> >> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90,
>>> id=13) at
>>> >> >> > main/pgpool_main.c:863
>>> >> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
>>> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>>> >> >> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
>>> >> >> > main/main.c:365
>>> >> >>
>>> >> >> Around protocol/child.c:1076:
>>> >> >>                                 /* set DB node id */
>>> >> >>
>>>  pool_set_db_node_id(CONNECTION(backend,
>>> >> >> i), i);
>>> >> >>
>>> >> >> I want to see the values in the "backend" struct. Since CONNECTION
>>> >> >> macro is used here, you have to do something like in a gdb session.
>>> >> >>
>>> >> >> p *backend->slots[0]
>>> >> >> p *backend->slots[1]
>>> >> >> p *backend->slots[2]
>>> >> >>
>>> >> >> Best reagards,
>>> >> >> --
>>> >> >> Tatsuo Ishii
>>> >> >> SRA OSS LLC
>>> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >> >> Japanese:http://www.sraoss.co.jp
>>> >> >>
>>> >>
>>>
>>