<div dir="ltr">Hi,<div><br></div><div>We've been observing some strange status reports after a controlled switchover (needed for a reboot) in our automated tests. In this scenario, we have 3 nodes, all running pgpool 4.5.0 and postgresql 12.16. Node 1 is both watchdog leader and runs the primary database. Node 2 and 3 reboot first, no special care is required, because both are standby. Finally, node 1 needs to reboot. We first force pgpool to drop its leader status by restarting it and then stop and detach the database to force a failover. Next, we bring the database back up to configure it to follow the new primary. At this point, we reboot node 1, and this is were something goes wrong and pgpool processes segfault throughout the cluster.</div><div><br></div><div>I've collected several coredumps and identified 2 different backtraces. The first is most common and occurs on node 2 (the new primary, standby watchdog):</div><div>#0  0x0000559e25313126 in get_query_result (slots=0x7fff0ebcff50, backend_id=0, query=0x559e253f63b0 "SELECT pg_catalog.current_setting('server_version_num')", res=0x7fff0ebcf778) at streaming_replication/pool_worker_child.c:682<br>#1  0x0000559e252a6ca3 in get_server_version (slots=0x7fff0ebcff50, node_id=0) at main/pgpool_main.c:3878<br>#2  0x0000559e252a2bd6 in verify_backend_node_status (slots=0x7fff0ebcff50) at main/pgpool_main.c:2574<br>#3  0x0000559e252a3904 in find_primary_node () at main/pgpool_main.c:2791<br>#4  0x0000559e252a3e04 in find_primary_node_repeatedly () at main/pgpool_main.c:2892<br>#5  0x0000559e252a8a71 in determine_new_primary_node (failover_context=0x7fff0ebd0420, node_id=0) at main/pgpool_main.c:4510<br>#6  0x0000559e252a0eaf in failover () at main/pgpool_main.c:1719<br>#7  0x0000559e252a04f1 in sigusr1_interrupt_processor () at main/pgpool_main.c:1507<br>#8  0x0000559e252a9df0 in check_requests () at main/pgpool_main.c:4930<br>#9  0x0000559e2529e18e in PgpoolMain (discard_status=0 '\000', clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649<br>#10 0x0000559e2529b9b2 in main (argc=2, argv=0x7fff0ebdd5c8) at main/main.c:365<br></div><div><br></div><div>The second is a bit less common and occurs on node 3 (standby database, new watchdog leader) and always occurs in tandem about 20 seconds after the first:</div><div>#0  pfree (pointer=0x55b7fe6ef4f0) at ../../src/utils/mmgr/mcxt.c:956<br>#1  0x000055b7fe09944b in free_persistent_db_connection_memory (cp=0x55b7fe6ef4c0) at protocol/pool_pg_utils.c:231<br>#2  0x000055b7fe0992b6 in make_persistent_db_connection (db_node_id=0, hostname=0x7fb6f6346008 "172.29.30.1", port=5432, dbname=0x55b7fe1b408f "postgres", user=0x55b7fe724268 "keyhub", password=0x55b7fe6ef408 "jCyuMFIENk1DU8-VWf2vFMx_DeFv25zTCG_0ZroG", retry=0 '\000')<br>    at protocol/pool_pg_utils.c:164<br>#3  0x000055b7fe0993a1 in make_persistent_db_connection_noerror (db_node_id=0, hostname=0x7fb6f6346008 "172.29.30.1", port=5432, dbname=0x55b7fe1b408f "postgres", user=0x55b7fe724268 "keyhub", password=0x55b7fe6ef408 "jCyuMFIENk1DU8-VWf2vFMx_DeFv25zTCG_0ZroG", <br>    retry=0 '\000') at protocol/pool_pg_utils.c:185<br>#4  0x000055b7fe068397 in establish_persistent_connection (node=0) at main/health_check.c:365<br>#5  0x000055b7fe067b79 in do_health_check_child (node_id=0x7ffc6ef09158) at main/health_check.c:199<br>#6  0x000055b7fe05ba04 in worker_fork_a_child (type=PT_HEALTH_CHECK, func=0x55b7fe067798 <do_health_check_child>, params=0x7ffc6ef09158) at main/pgpool_main.c:912<br>#7  0x000055b7fe05b0e2 in PgpoolMain (discard_status=0 '\000', clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:625<br>#8  0x000055b7fe0589b2 in main (argc=2, argv=0x7ffc6ef153a8) at main/main.c:365<br></div><div><br></div><div>After this segfault, the status of node 1 (backend 0) seems to be corrupt and is reported as 'Status Name=down' and 'Backend Status Name=up' and it does not recover from this. I've attached core dumps for both crashes, the reported status and pgpool logs for node 1, 2 and 3. Note that the pgpool logs for 1 and 3 report a segfault, while the core dumps are for node 2 and 3. I do not have a core dump for node 1 and also no kernel logging about a protection fault on that node.</div><div><br></div><div>PS. These logs and core dumps are all from tests, using randomly generated passwords and secrets, so these have no value to us.</div><div><br></div><div>Best regards,</div><div>Emond Papegaaij</div></div>