<div dir="auto"><div>Hi, <div dir="auto"><br></div><div dir="auto">Thanks for the quick fix! I'm replying from my holiday address, so I'll keep it short. I've asked a coworker to add the patch to our build. He will monitor the build. </div><div dir="auto"><br></div><div dir="auto">Best regards, </div><div dir="auto">Emond </div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op ma 22 jul 2024 09:12 schreef Tatsuo Ishii <<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Emond,<br>

<br>

Thanks for the report. I have looked into the log file and ran gdb.<br>

<br>

> multiple processes that crashed, the backtraces are identical. The coredump<br>

> is for pid 5590, which crashed at 01:10:27.<br>

<br>

I think the pid is 10, rather than 5590.<br>

<br>

In summary. I think the segfault occured before the pgpool main<br>

process finished the initialization process.<br>

<br>

The immediate cause of the segfault was here: (src/auth/pool_auth.c: 439)<br>

<br>

        protoMajor = MAIN_CONNECTION(cp)->sp->major;<br>

<br>

MAIN_CONNECTION(cp) is a macro defined as:<br>

<br>

cp->slots[MAIN_NODE_ID]<br>

<br>

in this case. my_main_node_id is a local variable in the pgpool child<br>

process (pid 10) copied from Req_info->main_node_id, which represents<br>

the first live backend node. It was set to 0 at 01:04:06.<br>

<br>

> jul 20 01:04:06 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:06: pid 10: DEBUG:  initializing backend status<br>

<br>

Note that local backend statuses (private_backend_status) were also<br>

set at the same time. Because pgpool_status did not exist,<br>

<br>

> jul 20 01:04:00 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:00: pid 1: LOG:  Backend status file /var/run/postgresql/pgpool_status does not exist<br>

<br>

the initial status of each backend remained CON_CONNECT_WAIT (this is<br>

set by BackendPortAssignFunc() in pool_config_variable.c).<br>

<br>

Since at this point all backend global status were CON_CONNECT_WAIT,<br>

Req_info->main_node_id and my_main_node_id were both set to 0.<br>

<br>

At 01:04:08 the global backend status was changed.  The backend 0 and<br>

1 statuses were set to down because the leader watchdog said so (see<br>

sync_backend_from_watchdog()).<br>

<br>

> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG:  leader watchdog node "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d" returned status for 3 backend nodes<br>

> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DEBUG:  primary node on leader watchdog node "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d" is 0<br>

> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG:  backend:0 is set to down status<br>

> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DETAIL:  backend:0 is DOWN on cluster leader "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d"<br>

> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG:  backend:1 is set to down status<br>

> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DETAIL:  backend:1 is DOWN on cluster leader "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d"<br>

<br>

At the same time Req_info->main_node_id was set to 2 since node 2 is<br>

the only live node. However, my_backend_status and my_main_node_id<br>

reamined the same. This was supposed to be set in new_connection().<br>

<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  connecting 0 backend<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connection to backend<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  skipping backend slot 0 because global backend_status = 3<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connection to backend<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  connecting 1 backend<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connecti> on to backend<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  skipping backend slot 1 because global backend_status = 3<br>

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connection to backend<br>

<br>

my_backend_status was synced with the global backend status in<br>

new_connection() around line 906 to 922. So far so good. However<br>

my_backend_status was not updated. I think it should have been updated<br>

there as well. As a result, <br>

<br>

        protoMajor = MAIN_CONNECTION(cp)->sp->major;<br>

<br>

went to:<br>

<br>

        protoMajor = cp->slots[0]->sp->major;<br>

<br>

Since cp->slots[0] is NULL, the segfault occured.<br>

Attached is a fix for this.<br>

<br>

Best reagards,<br>

<br>

[1] Actually the implementation is that simple but for now this is<br>

enough explanation...<br>

<br>

--<br>

Tatsuo Ishii<br>

SRA OSS LLC<br>

English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>

Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

<br>

> Hi,<br>

> <br>

> It's been a while without any crashes, but yesterday we spotted another<br>

> one. I do think I've seen this one before. The backtrace is:<br>

> #0  pool_do_auth (frontend=0x558c1ec0f7c8, cp=0x7ff3e72ced58) at<br>

> auth/pool_auth.c:349<br>

> #1  0x0000558c1de03ea1 in connect_backend (sp=0x558c1ec11718,<br>

> frontend=0x558c1ec0f7c8) at protocol/child.c:1102<br>

> #2  0x0000558c1de05ff7 in get_backend_connection (frontend=0x558c1ec0f7c8)<br>

> at protocol/child.c:2103<br>

> #3  0x0000558c1de01fd5 in do_child (fds=0x558c1ebc5380) at<br>

> protocol/child.c:416<br>

> #4  0x0000558c1ddc7a4c in fork_a_child (fds=0x558c1ebc5380, id=0) at<br>

> main/pgpool_main.c:863<br>

> #5  0x0000558c1ddc6e30 in PgpoolMain (discard_status=0 '\000',<br>

> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>

> #6  0x0000558c1ddc49e6 in main (argc=2, argv=0x7fff25dfd378) at<br>

> main/main.c:365<br>

> <br>

> I've attached the logs, the coredump and the binary. I won't be able to<br>

> respond for the upcoming 3 weeks, due to a holiday. I did save all logs, so<br>

> if you do need more information, I can provide it in 3 weeks. There were<br>

> multiple processes that crashed, the backtraces are identical. The coredump<br>

> is for pid 5590, which crashed at 01:10:27.<br>

> <br>

> Best regards,<br>

> Emond<br>

</blockquote></div></div></div>