[pgpool-general: 9056] Segmentation after switchover

Sat Mar 30 06:07:47 JST 2024

Hi,

We've been observing some strange status reports after a controlled
switchover (needed for a reboot) in our automated tests. In this scenario,
we have 3 nodes, all running pgpool 4.5.0 and postgresql 12.16. Node 1 is
both watchdog leader and runs the primary database. Node 2 and 3 reboot
first, no special care is required, because both are standby. Finally, node
1 needs to reboot. We first force pgpool to drop its leader status by
restarting it and then stop and detach the database to force a failover.
Next, we bring the database back up to configure it to follow the new
primary. At this point, we reboot node 1, and this is were something goes
wrong and pgpool processes segfault throughout the cluster.

I've collected several coredumps and identified 2 different backtraces. The
first is most common and occurs on node 2 (the new primary, standby
watchdog):
#0  0x0000559e25313126 in get_query_result (slots=0x7fff0ebcff50,
backend_id=0, query=0x559e253f63b0 "SELECT
pg_catalog.current_setting('server_version_num')", res=0x7fff0ebcf778) at
streaming_replication/pool_worker_child.c:682
#1  0x0000559e252a6ca3 in get_server_version (slots=0x7fff0ebcff50,
node_id=0) at main/pgpool_main.c:3878
#2  0x0000559e252a2bd6 in verify_backend_node_status (slots=0x7fff0ebcff50)
at main/pgpool_main.c:2574
#3  0x0000559e252a3904 in find_primary_node () at main/pgpool_main.c:2791
#4  0x0000559e252a3e04 in find_primary_node_repeatedly () at
main/pgpool_main.c:2892
#5  0x0000559e252a8a71 in determine_new_primary_node
(failover_context=0x7fff0ebd0420, node_id=0) at main/pgpool_main.c:4510
#6  0x0000559e252a0eaf in failover () at main/pgpool_main.c:1719
#7  0x0000559e252a04f1 in sigusr1_interrupt_processor () at
main/pgpool_main.c:1507
#8  0x0000559e252a9df0 in check_requests () at main/pgpool_main.c:4930
#9  0x0000559e2529e18e in PgpoolMain (discard_status=0 '\000',
clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
#10 0x0000559e2529b9b2 in main (argc=2, argv=0x7fff0ebdd5c8) at
main/main.c:365

The second is a bit less common and occurs on node 3 (standby database, new
watchdog leader) and always occurs in tandem about 20 seconds after the
first:
#0  pfree (pointer=0x55b7fe6ef4f0) at ../../src/utils/mmgr/mcxt.c:956
#1  0x000055b7fe09944b in free_persistent_db_connection_memory
(cp=0x55b7fe6ef4c0) at protocol/pool_pg_utils.c:231
#2  0x000055b7fe0992b6 in make_persistent_db_connection (db_node_id=0,
hostname=0x7fb6f6346008 "172.29.30.1", port=5432, dbname=0x55b7fe1b408f
"postgres", user=0x55b7fe724268 "keyhub", password=0x55b7fe6ef408
"jCyuMFIENk1DU8-VWf2vFMx_DeFv25zTCG_0ZroG", retry=0 '\000')
    at protocol/pool_pg_utils.c:164
#3  0x000055b7fe0993a1 in make_persistent_db_connection_noerror
(db_node_id=0, hostname=0x7fb6f6346008 "172.29.30.1", port=5432,
dbname=0x55b7fe1b408f "postgres", user=0x55b7fe724268 "keyhub",
password=0x55b7fe6ef408 "jCyuMFIENk1DU8-VWf2vFMx_DeFv25zTCG_0ZroG",
    retry=0 '\000') at protocol/pool_pg_utils.c:185
#4  0x000055b7fe068397 in establish_persistent_connection (node=0) at
main/health_check.c:365
#5  0x000055b7fe067b79 in do_health_check_child (node_id=0x7ffc6ef09158) at
main/health_check.c:199
#6  0x000055b7fe05ba04 in worker_fork_a_child (type=PT_HEALTH_CHECK,
func=0x55b7fe067798 <do_health_check_child>, params=0x7ffc6ef09158) at
main/pgpool_main.c:912
#7  0x000055b7fe05b0e2 in PgpoolMain (discard_status=0 '\000',
clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:625
#8  0x000055b7fe0589b2 in main (argc=2, argv=0x7ffc6ef153a8) at
main/main.c:365

After this segfault, the status of node 1 (backend 0) seems to be corrupt
and is reported as 'Status Name=down' and 'Backend Status Name=up' and it
does not recover from this. I've attached core dumps for both crashes, the
reported status and pgpool logs for node 1, 2 and 3. Note that the pgpool
logs for 1 and 3 report a segfault, while the core dumps are for node 2 and
3. I do not have a core dump for node 1 and also no kernel logging about a
protection fault on that node.

PS. These logs and core dumps are all from tests, using randomly generated
passwords and secrets, so these have no value to us.

Best regards,
Emond Papegaaij
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240329/4e6a7226/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node2-core.pgpool.208.25a5121bed0342c596c010f265e4e113.6161.1711741294000000.gz
Type: application/gzip
Size: 2489564 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240329/4e6a7226/attachment.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node3-core.pgpool.208.5714dc7c3cc14f7091b93b476e7c948b.6388.1711741315000000.gz
Type: application/gzip
Size: 2488005 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240329/4e6a7226/attachment-0001.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node2-pgpool.log
Type: text/x-log
Size: 122194 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240329/4e6a7226/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node3-pgpool.log
Type: text/x-log
Size: 91612 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240329/4e6a7226/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node1-pgpool.log
Type: text/x-log
Size: 256218 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240329/4e6a7226/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpoolstatus-aftersegfault.log
Type: text/x-log
Size: 2333 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240329/4e6a7226/attachment-0003.bin>