[pgpool-general: 9178] Re: Segmentation fault

Tue Jul 30 07:44:35 JST 2024

Emond,

Sorry for late reply, I missed your email. Thanks for
the consideration. Have a nice vacation!

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi,
> 
> Thanks for the quick fix! I'm replying from my holiday address, so I'll
> keep it short. I've asked a coworker to add the patch to our build. He will
> monitor the build.
> 
> Best regards,
> Emond
> 
> 
> Op ma 22 jul 2024 09:12 schreef Tatsuo Ishii <ishii at postgresql.org>:
> 
>> Hi Emond,
>>
>> Thanks for the report. I have looked into the log file and ran gdb.
>>
>> > multiple processes that crashed, the backtraces are identical. The
>> coredump
>> > is for pid 5590, which crashed at 01:10:27.
>>
>> I think the pid is 10, rather than 5590.
>>
>> In summary. I think the segfault occured before the pgpool main
>> process finished the initialization process.
>>
>> The immediate cause of the segfault was here: (src/auth/pool_auth.c: 439)
>>
>>         protoMajor = MAIN_CONNECTION(cp)->sp->major;
>>
>> MAIN_CONNECTION(cp) is a macro defined as:
>>
>> cp->slots[MAIN_NODE_ID]
>>
>> in this case. my_main_node_id is a local variable in the pgpool child
>> process (pid 10) copied from Req_info->main_node_id, which represents
>> the first live backend node. It was set to 0 at 01:04:06.
>>
>> > jul 20 01:04:06 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:06: pid 10: DEBUG:  initializing backend status
>>
>> Note that local backend statuses (private_backend_status) were also
>> set at the same time. Because pgpool_status did not exist,
>>
>> > jul 20 01:04:00 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:00: pid 1: LOG:  Backend status file
>> /var/run/postgresql/pgpool_status does not exist
>>
>> the initial status of each backend remained CON_CONNECT_WAIT (this is
>> set by BackendPortAssignFunc() in pool_config_variable.c).
>>
>> Since at this point all backend global status were CON_CONNECT_WAIT,
>> Req_info->main_node_id and my_main_node_id were both set to 0.
>>
>> At 01:04:08 the global backend status was changed.  The backend 0 and
>> 1 statuses were set to down because the leader watchdog said so (see
>> sync_backend_from_watchdog()).
>>
>> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:08: pid 1: LOG:  leader watchdog node "172.29.30.1:5432 Linux
>> 1688c0b0b95d" returned status for 3 backend nodes
>> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:08: pid 1: DEBUG:  primary node on leader watchdog node "
>> 172.29.30.1:5432 Linux 1688c0b0b95d" is 0
>> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:08: pid 1: LOG:  backend:0 is set to down status
>> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:08: pid 1: DETAIL:  backend:0 is DOWN on cluster leader "
>> 172.29.30.1:5432 Linux 1688c0b0b95d"
>> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:08: pid 1: LOG:  backend:1 is set to down status
>> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:04:08: pid 1: DETAIL:  backend:1 is DOWN on cluster leader "
>> 172.29.30.1:5432 Linux 1688c0b0b95d"
>>
>> At the same time Req_info->main_node_id was set to 2 since node 2 is
>> the only live node. However, my_backend_status and my_main_node_id
>> reamined the same. This was supposed to be set in new_connection().
>>
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DETAIL:  connecting 0 backend
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DEBUG:  creating new connection to backend
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DETAIL:  skipping backend slot 0 because global
>> backend_status = 3
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DEBUG:  creating new connection to backend
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DETAIL:  connecting 1 backend
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DEBUG:  creating new connecti> on to backend
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DETAIL:  skipping backend slot 1 because global
>> backend_status = 3
>> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
>> 01:10:27: pid 10: DEBUG:  creating new connection to backend
>>
>> my_backend_status was synced with the global backend status in
>> new_connection() around line 906 to 922. So far so good. However
>> my_backend_status was not updated. I think it should have been updated
>> there as well. As a result,
>>
>>         protoMajor = MAIN_CONNECTION(cp)->sp->major;
>>
>> went to:
>>
>>         protoMajor = cp->slots[0]->sp->major;
>>
>> Since cp->slots[0] is NULL, the segfault occured.
>> Attached is a fix for this.
>>
>> Best reagards,
>>
>> [1] Actually the implementation is that simple but for now this is
>> enough explanation...
>>
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi,
>> >
>> > It's been a while without any crashes, but yesterday we spotted another
>> > one. I do think I've seen this one before. The backtrace is:
>> > #0  pool_do_auth (frontend=0x558c1ec0f7c8, cp=0x7ff3e72ced58) at
>> > auth/pool_auth.c:349
>> > #1  0x0000558c1de03ea1 in connect_backend (sp=0x558c1ec11718,
>> > frontend=0x558c1ec0f7c8) at protocol/child.c:1102
>> > #2  0x0000558c1de05ff7 in get_backend_connection
>> (frontend=0x558c1ec0f7c8)
>> > at protocol/child.c:2103
>> > #3  0x0000558c1de01fd5 in do_child (fds=0x558c1ebc5380) at
>> > protocol/child.c:416
>> > #4  0x0000558c1ddc7a4c in fork_a_child (fds=0x558c1ebc5380, id=0) at
>> > main/pgpool_main.c:863
>> > #5  0x0000558c1ddc6e30 in PgpoolMain (discard_status=0 '\000',
>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>> > #6  0x0000558c1ddc49e6 in main (argc=2, argv=0x7fff25dfd378) at
>> > main/main.c:365
>> >
>> > I've attached the logs, the coredump and the binary. I won't be able to
>> > respond for the upcoming 3 weeks, due to a holiday. I did save all logs,
>> so
>> > if you do need more information, I can provide it in 3 weeks. There were
>> > multiple processes that crashed, the backtraces are identical. The
>> coredump
>> > is for pid 5590, which crashed at 01:10:27.
>> >
>> > Best regards,
>> > Emond
>>