<div dir="auto"><div>Hi, <div dir="auto"><br></div><div dir="auto">Thanks for the quick fix! I'm replying from my holiday address, so I'll keep it short. I've asked a coworker to add the patch to our build. He will monitor the build. </div><div dir="auto"><br></div><div dir="auto">Best regards, </div><div dir="auto">Emond </div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op ma 22 jul 2024 09:12 schreef Tatsuo Ishii <<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Emond,<br>
<br>
Thanks for the report. I have looked into the log file and ran gdb.<br>
<br>
> multiple processes that crashed, the backtraces are identical. The coredump<br>
> is for pid 5590, which crashed at 01:10:27.<br>
<br>
I think the pid is 10, rather than 5590.<br>
<br>
In summary. I think the segfault occured before the pgpool main<br>
process finished the initialization process.<br>
<br>
The immediate cause of the segfault was here: (src/auth/pool_auth.c: 439)<br>
<br>
protoMajor = MAIN_CONNECTION(cp)->sp->major;<br>
<br>
MAIN_CONNECTION(cp) is a macro defined as:<br>
<br>
cp->slots[MAIN_NODE_ID]<br>
<br>
in this case. my_main_node_id is a local variable in the pgpool child<br>
process (pid 10) copied from Req_info->main_node_id, which represents<br>
the first live backend node. It was set to 0 at 01:04:06.<br>
<br>
> jul 20 01:04:06 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:06: pid 10: DEBUG: initializing backend status<br>
<br>
Note that local backend statuses (private_backend_status) were also<br>
set at the same time. Because pgpool_status did not exist,<br>
<br>
> jul 20 01:04:00 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:00: pid 1: LOG: Backend status file /var/run/postgresql/pgpool_status does not exist<br>
<br>
the initial status of each backend remained CON_CONNECT_WAIT (this is<br>
set by BackendPortAssignFunc() in pool_config_variable.c).<br>
<br>
Since at this point all backend global status were CON_CONNECT_WAIT,<br>
Req_info->main_node_id and my_main_node_id were both set to 0.<br>
<br>
At 01:04:08 the global backend status was changed. The backend 0 and<br>
1 statuses were set to down because the leader watchdog said so (see<br>
sync_backend_from_watchdog()).<br>
<br>
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG: leader watchdog node "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d" returned status for 3 backend nodes<br>
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DEBUG: primary node on leader watchdog node "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d" is 0<br>
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG: backend:0 is set to down status<br>
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DETAIL: backend:0 is DOWN on cluster leader "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d"<br>
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG: backend:1 is set to down status<br>
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DETAIL: backend:1 is DOWN on cluster leader "<a href="http://172.29.30.1:5432" rel="noreferrer noreferrer" target="_blank">172.29.30.1:5432</a> Linux 1688c0b0b95d"<br>
<br>
At the same time Req_info->main_node_id was set to 2 since node 2 is<br>
the only live node. However, my_backend_status and my_main_node_id<br>
reamined the same. This was supposed to be set in new_connection().<br>
<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL: connecting 0 backend<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG: creating new connection to backend<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL: skipping backend slot 0 because global backend_status = 3<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG: creating new connection to backend<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL: connecting 1 backend<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG: creating new connecti> on to backend<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL: skipping backend slot 1 because global backend_status = 3<br>
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG: creating new connection to backend<br>
<br>
my_backend_status was synced with the global backend status in<br>
new_connection() around line 906 to 922. So far so good. However<br>
my_backend_status was not updated. I think it should have been updated<br>
there as well. As a result, <br>
<br>
protoMajor = MAIN_CONNECTION(cp)->sp->major;<br>
<br>
went to:<br>
<br>
protoMajor = cp->slots[0]->sp->major;<br>
<br>
Since cp->slots[0] is NULL, the segfault occured.<br>
Attached is a fix for this.<br>
<br>
Best reagards,<br>
<br>
[1] Actually the implementation is that simple but for now this is<br>
enough explanation...<br>
<br>
--<br>
Tatsuo Ishii<br>
SRA OSS LLC<br>
English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>
Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
<br>
> Hi,<br>
> <br>
> It's been a while without any crashes, but yesterday we spotted another<br>
> one. I do think I've seen this one before. The backtrace is:<br>
> #0 pool_do_auth (frontend=0x558c1ec0f7c8, cp=0x7ff3e72ced58) at<br>
> auth/pool_auth.c:349<br>
> #1 0x0000558c1de03ea1 in connect_backend (sp=0x558c1ec11718,<br>
> frontend=0x558c1ec0f7c8) at protocol/child.c:1102<br>
> #2 0x0000558c1de05ff7 in get_backend_connection (frontend=0x558c1ec0f7c8)<br>
> at protocol/child.c:2103<br>
> #3 0x0000558c1de01fd5 in do_child (fds=0x558c1ebc5380) at<br>
> protocol/child.c:416<br>
> #4 0x0000558c1ddc7a4c in fork_a_child (fds=0x558c1ebc5380, id=0) at<br>
> main/pgpool_main.c:863<br>
> #5 0x0000558c1ddc6e30 in PgpoolMain (discard_status=0 '\000',<br>
> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>
> #6 0x0000558c1ddc49e6 in main (argc=2, argv=0x7fff25dfd378) at<br>
> main/main.c:365<br>
> <br>
> I've attached the logs, the coredump and the binary. I won't be able to<br>
> respond for the upcoming 3 weeks, due to a holiday. I did save all logs, so<br>
> if you do need more information, I can provide it in 3 weeks. There were<br>
> multiple processes that crashed, the backtraces are identical. The coredump<br>
> is for pid 5590, which crashed at 01:10:27.<br>
> <br>
> Best regards,<br>
> Emond<br>
</blockquote></div></div></div>