[pgpool-general: 9177] Re: Segmentation fault

Tue Jul 23 14:39:57 JST 2024

Hi,

Thanks for the quick fix! I'm replying from my holiday address, so I'll
keep it short. I've asked a coworker to add the patch to our build. He will
monitor the build.

Best regards,
Emond

Op ma 22 jul 2024 09:12 schreef Tatsuo Ishii <ishii at postgresql.org>:

> Hi Emond,
>
> Thanks for the report. I have looked into the log file and ran gdb.
>
> > multiple processes that crashed, the backtraces are identical. The
> coredump
> > is for pid 5590, which crashed at 01:10:27.
>
> I think the pid is 10, rather than 5590.
>
> In summary. I think the segfault occured before the pgpool main
> process finished the initialization process.
>
> The immediate cause of the segfault was here: (src/auth/pool_auth.c: 439)
>
>         protoMajor = MAIN_CONNECTION(cp)->sp->major;
>
> MAIN_CONNECTION(cp) is a macro defined as:
>
> cp->slots[MAIN_NODE_ID]
>
> in this case. my_main_node_id is a local variable in the pgpool child
> process (pid 10) copied from Req_info->main_node_id, which represents
> the first live backend node. It was set to 0 at 01:04:06.
>
> > jul 20 01:04:06 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:06: pid 10: DEBUG:  initializing backend status
>
> Note that local backend statuses (private_backend_status) were also
> set at the same time. Because pgpool_status did not exist,
>
> > jul 20 01:04:00 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:00: pid 1: LOG:  Backend status file
> /var/run/postgresql/pgpool_status does not exist
>
> the initial status of each backend remained CON_CONNECT_WAIT (this is
> set by BackendPortAssignFunc() in pool_config_variable.c).
>
> Since at this point all backend global status were CON_CONNECT_WAIT,
> Req_info->main_node_id and my_main_node_id were both set to 0.
>
> At 01:04:08 the global backend status was changed.  The backend 0 and
> 1 statuses were set to down because the leader watchdog said so (see
> sync_backend_from_watchdog()).
>
> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:08: pid 1: LOG:  leader watchdog node "172.29.30.1:5432 Linux
> 1688c0b0b95d" returned status for 3 backend nodes
> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:08: pid 1: DEBUG:  primary node on leader watchdog node "
> 172.29.30.1:5432 Linux 1688c0b0b95d" is 0
> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:08: pid 1: LOG:  backend:0 is set to down status
> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:08: pid 1: DETAIL:  backend:0 is DOWN on cluster leader "
> 172.29.30.1:5432 Linux 1688c0b0b95d"
> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:08: pid 1: LOG:  backend:1 is set to down status
> > jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:04:08: pid 1: DETAIL:  backend:1 is DOWN on cluster leader "
> 172.29.30.1:5432 Linux 1688c0b0b95d"
>
> At the same time Req_info->main_node_id was set to 2 since node 2 is
> the only live node. However, my_backend_status and my_main_node_id
> reamined the same. This was supposed to be set in new_connection().
>
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DETAIL:  connecting 0 backend
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DEBUG:  creating new connection to backend
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DETAIL:  skipping backend slot 0 because global
> backend_status = 3
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DEBUG:  creating new connection to backend
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DETAIL:  connecting 1 backend
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DEBUG:  creating new connecti> on to backend
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DETAIL:  skipping backend slot 1 because global
> backend_status = 3
> > jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20
> 01:10:27: pid 10: DEBUG:  creating new connection to backend
>
> my_backend_status was synced with the global backend status in
> new_connection() around line 906 to 922. So far so good. However
> my_backend_status was not updated. I think it should have been updated
> there as well. As a result,
>
>         protoMajor = MAIN_CONNECTION(cp)->sp->major;
>
> went to:
>
>         protoMajor = cp->slots[0]->sp->major;
>
> Since cp->slots[0] is NULL, the segfault occured.
> Attached is a fix for this.
>
> Best reagards,
>
> [1] Actually the implementation is that simple but for now this is
> enough explanation...
>
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi,
> >
> > It's been a while without any crashes, but yesterday we spotted another
> > one. I do think I've seen this one before. The backtrace is:
> > #0  pool_do_auth (frontend=0x558c1ec0f7c8, cp=0x7ff3e72ced58) at
> > auth/pool_auth.c:349
> > #1  0x0000558c1de03ea1 in connect_backend (sp=0x558c1ec11718,
> > frontend=0x558c1ec0f7c8) at protocol/child.c:1102
> > #2  0x0000558c1de05ff7 in get_backend_connection
> (frontend=0x558c1ec0f7c8)
> > at protocol/child.c:2103
> > #3  0x0000558c1de01fd5 in do_child (fds=0x558c1ebc5380) at
> > protocol/child.c:416
> > #4  0x0000558c1ddc7a4c in fork_a_child (fds=0x558c1ebc5380, id=0) at
> > main/pgpool_main.c:863
> > #5  0x0000558c1ddc6e30 in PgpoolMain (discard_status=0 '\000',
> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> > #6  0x0000558c1ddc49e6 in main (argc=2, argv=0x7fff25dfd378) at
> > main/main.c:365
> >
> > I've attached the logs, the coredump and the binary. I won't be able to
> > respond for the upcoming 3 weeks, due to a holiday. I did save all logs,
> so
> > if you do need more information, I can provide it in 3 weeks. There were
> > multiple processes that crashed, the backtraces are identical. The
> coredump
> > is for pid 5590, which crashed at 01:10:27.
> >
> > Best regards,
> > Emond
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240723/66634ec8/attachment.htm>