[pgpool-general: 9135] Re: Another segmentation fault

Tatsuo Ishii ishii at sraoss.co.jp
Wed Jun 19 21:40:38 JST 2024


Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,
but failed to find "segmentation fault" string.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi,
> 
> Thanks for the patch. I'll add it to the build of our pgpool containers.
> 
> Since I've added the previous patch, I've got the feeling that the number
> of segmentation faults has been reduced. It's hard to say though if the
> problem is really fixed, because some of these crashes happen very
> infrequently.
> 
> Yesterday, we did see 2 crashes, both with the same backtrace, which I
> haven't seen before:
> #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at
> auth/pool_auth.c:349
> #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,
> frontend=0x55bc42c93788) at protocol/child.c:1102
> #2  0x000055bc41859042 in get_backend_connection (frontend=0x55bc42c93788)
> at protocol/child.c:2111
> #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at
> protocol/child.c:416
> #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at
> main/pgpool_main.c:863
> #5  0x000055bc418256f7 in exec_child_restart
> (failover_context=0x7ffdf033c6b0, node_id=0) at main/pgpool_main.c:4684
> #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739
> #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at
> main/pgpool_main.c:1507
> #8  0x000055bc418263c3 in check_requests () at main/pgpool_main.c:4934
> #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',
> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
> #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at
> main/main.c:365
> 
> Looking at the backtrace, this happens right after the place where the
> first patch made a change. The slots in 'cp' are {0x0, 0x7ff7ca3479b8,
> 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and the
> coredump. This crash happens during a different test (in both cases). It
> crashes in the test for our normal upgrade procedure. The crash happens
> when the cluster is in a fully consistent configuration, all 3 nodes are up
> and healthy. node 0 is both watchdog leader and running the primary
> database. Because node 0 is the first to be upgraded, we need to initiate a
> failover. For this, we first restart pgpool to force it to drop its leader
> status and wait until a new leader has been elected. We then stop the
> database to trigger the failover. The crash seems to happen when the
> database started up again. In the attached logs, the crash was at node 0,
> but in the other failure, the crash was at the exact same moment in the
> test, but at node 2. I'm also seeing some suspicious backend status reports
> from pcp_node_info at that time:
> 
> Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0 none
> none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024
> Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby 130337
> streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST
> 
> ps. I had to resend the mail with the attachments bzip2ed because it was
> hitting the size limit.
> 
> Best regards,
> Emond
> 
> Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> 
>> > For the second crash (in pool_connection_pool.c:326), all slots are null:
>> > (gdb) info locals
>> > i = 0
>> > freed = 0
>> > closetime = 0
>> > oldestp = 0x7f474f36e168
>> > ret = 0x0
>> > info = 0x55b511283469 <opt_sort>
>> > p = 0x7f474f36e168
>> >
>> > (gdb) p *p
>> > $7 = {
>> >   info = 0x7f47470d5c08,
>> >   slots = {0x0 <repeats 128 times>}
>> > }
>> >
>> > (gdb) p p->slots[0]
>> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> > (gdb) p p->slots[1]
>> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> > (gdb) p p->slots[2]
>> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>>
>> Possible explanation is, Req_info->main_node_id, which is the smallest
>> backend node id number being alive, was -1. From the log file just
>> before pid 30 segfaults:
>>
>> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> backend:0 is set to down status
>> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> backend:0 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5e07f2"
>> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> backend:1 is set to down status
>> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> backend:1 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5e07f2"
>> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:
>> backend:2 is set to down status
>> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:
>> backend:2 is DOWN on cluster leader "172.29.30.2:5432 Linux 216dfd5
>>
>> That means all backends were down, and Req_info->main_node_id should
>> have been -1 at the time. Usually if all backend are down, pgpool will
>> not accept connections from frontend. I think when the check was
>> peformed, not all backend are down. Right after this all backend went
>> down I guess. If so, at line 270 of pool_connection_pool.c:
>>
>> 270                 if (MAIN_CONNECTION(p) == NULL)
>>
>> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)
>> looked into garbage and above condition was not met. Then the code
>> proceeded to line 326:
>>
>>  326                         pool_free_startup_packet(CONNECTION_SLOT(p,
>> i)->sp);
>>
>> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.
>>
>> I think using MAIN_CONNECTION macro to take care of connection
>> poolings is incorrect. Which backend is alive in the each connection
>> pool slot could be different from current alive backend. So I invented
>> a function in_use_backend_id() for seaching live backend ids in a
>> connection slot. Attached is the patch in this direction. I hope the
>> patch to fix the second segfault case.
>>
>> > I hope this helps.
>> >
>> > Best regards,
>> > Emond
>> >
>> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
>> >
>> >> > No worries. I hope you had a good trip. Last night we triggered the
>> last
>> >> > crash again. Is there anything we can do to make it easier for you to
>> >> find
>> >> > the cause?
>> >>
>> >> It would be helpful if you could share some variable values in the
>> >> core file. Since I don't have the pgpool load module when you got the
>> >> core, I cannot inspect the variables using the core you provided.
>> >>
>> >> > #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at
>> >> > protocol/child.c:1076
>> >> > #1  0x000055803ce3d02a in get_backend_connection
>> >> (frontend=0x55803eb08768)
>> >> > at protocol/child.c:2112
>> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>> >> > protocol/child.c:416
>> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at
>> >> > main/pgpool_main.c:863
>> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> >> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
>> >> > main/main.c:365
>> >>
>> >> Around protocol/child.c:1076:
>> >>                                 /* set DB node id */
>> >>                                 pool_set_db_node_id(CONNECTION(backend,
>> >> i), i);
>> >>
>> >> I want to see the values in the "backend" struct. Since CONNECTION
>> >> macro is used here, you have to do something like in a gdb session.
>> >>
>> >> p *backend->slots[0]
>> >> p *backend->slots[1]
>> >> p *backend->slots[2]
>> >>
>> >> Best reagards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS LLC
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>>



More information about the pgpool-general mailing list