[pgpool-general: 9176] Re: Segmentation fault

Mon Jul 22 16:11:49 JST 2024

Hi Emond,

Thanks for the report. I have looked into the log file and ran gdb.

> multiple processes that crashed, the backtraces are identical. The coredump
> is for pid 5590, which crashed at 01:10:27.

I think the pid is 10, rather than 5590.

In summary. I think the segfault occured before the pgpool main
process finished the initialization process.

The immediate cause of the segfault was here: (src/auth/pool_auth.c: 439)

	protoMajor = MAIN_CONNECTION(cp)->sp->major;

MAIN_CONNECTION(cp) is a macro defined as:

cp->slots[MAIN_NODE_ID]

in this case. my_main_node_id is a local variable in the pgpool child
process (pid 10) copied from Req_info->main_node_id, which represents
the first live backend node. It was set to 0 at 01:04:06.

> jul 20 01:04:06 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:06: pid 10: DEBUG:  initializing backend status

Note that local backend statuses (private_backend_status) were also
set at the same time. Because pgpool_status did not exist,

> jul 20 01:04:00 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:00: pid 1: LOG:  Backend status file /var/run/postgresql/pgpool_status does not exist

the initial status of each backend remained CON_CONNECT_WAIT (this is
set by BackendPortAssignFunc() in pool_config_variable.c).

Since at this point all backend global status were CON_CONNECT_WAIT,
Req_info->main_node_id and my_main_node_id were both set to 0.

At 01:04:08 the global backend status was changed.  The backend 0 and
1 statuses were set to down because the leader watchdog said so (see
sync_backend_from_watchdog()).

> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG:  leader watchdog node "172.29.30.1:5432 Linux 1688c0b0b95d" returned status for 3 backend nodes
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DEBUG:  primary node on leader watchdog node "172.29.30.1:5432 Linux 1688c0b0b95d" is 0
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG:  backend:0 is set to down status
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DETAIL:  backend:0 is DOWN on cluster leader "172.29.30.1:5432 Linux 1688c0b0b95d"
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: LOG:  backend:1 is set to down status
> jul 20 01:04:08 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:04:08: pid 1: DETAIL:  backend:1 is DOWN on cluster leader "172.29.30.1:5432 Linux 1688c0b0b95d"

At the same time Req_info->main_node_id was set to 2 since node 2 is
the only live node. However, my_backend_status and my_main_node_id
reamined the same. This was supposed to be set in new_connection().

> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  connecting 0 backend
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connection to backend
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  skipping backend slot 0 because global backend_status = 3
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connection to backend
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  connecting 1 backend
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connecti> on to backend
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DETAIL:  skipping backend slot 1 because global backend_status = 3
> jul 20 01:10:27 tkh-server2.keyhub-test-nw tkh-pgpool[1064]: 2024-07-20 01:10:27: pid 10: DEBUG:  creating new connection to backend

my_backend_status was synced with the global backend status in
new_connection() around line 906 to 922. So far so good. However
my_backend_status was not updated. I think it should have been updated
there as well. As a result, 

	protoMajor = MAIN_CONNECTION(cp)->sp->major;

went to:

	protoMajor = cp->slots[0]->sp->major;

Since cp->slots[0] is NULL, the segfault occured.
Attached is a fix for this.

Best reagards,

[1] Actually the implementation is that simple but for now this is
enough explanation...

--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi,
> 
> It's been a while without any crashes, but yesterday we spotted another
> one. I do think I've seen this one before. The backtrace is:
> #0  pool_do_auth (frontend=0x558c1ec0f7c8, cp=0x7ff3e72ced58) at
> auth/pool_auth.c:349
> #1  0x0000558c1de03ea1 in connect_backend (sp=0x558c1ec11718,
> frontend=0x558c1ec0f7c8) at protocol/child.c:1102
> #2  0x0000558c1de05ff7 in get_backend_connection (frontend=0x558c1ec0f7c8)
> at protocol/child.c:2103
> #3  0x0000558c1de01fd5 in do_child (fds=0x558c1ebc5380) at
> protocol/child.c:416
> #4  0x0000558c1ddc7a4c in fork_a_child (fds=0x558c1ebc5380, id=0) at
> main/pgpool_main.c:863
> #5  0x0000558c1ddc6e30 in PgpoolMain (discard_status=0 '\000',
> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> #6  0x0000558c1ddc49e6 in main (argc=2, argv=0x7fff25dfd378) at
> main/main.c:365
> 
> I've attached the logs, the coredump and the binary. I won't be able to
> respond for the upcoming 3 weeks, due to a holiday. I did save all logs, so
> if you do need more information, I can provide it in 3 weeks. There were
> multiple processes that crashed, the backtraces are identical. The coredump
> is for pid 5590, which crashed at 01:10:27.
> 
> Best regards,
> Emond
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix-for-pgpool-general-9175.patch
Type: text/x-patch
Size: 450 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240722/e457d00c/attachment.bin>