<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<div name="messageBodySection">
<div dir="auto">Hello<br />
<br />
I was performing some tests on my HA Cluster which has the following configuration - 1 primary, 2 standbys behind 1 pgpool without watchdog.<br />
<br />
<strong>TEST 1 - Storage failure (by making remote storage inaccessible)</strong><br />
<br />
Action - One client connected to pgpool and the instance’s storage disk was made inaccessible by blocking the connection to it from the firewall.<br />
Expected Behaviour - Pgpool health check should start to fail and a failover should be triggered after ‘health_check_max_retries' number of tries.<br />
Observed Behaviour - Pgpool health check didn’t detect anything wrong. `show pool_nodes` (pgpool queries) working fine. `\l` (postgresql queries) got stuck for ever and ^C to cancel the command didn’t work either. No new client was able to connect to pgpool.<br />
<br />
<br />
<strong>TEST 2 - Failover on failure of Online Recovery (or node incorrectly detected as invalid?)</strong><br />
<br />
Setup was running with 1 primary (Node 2) and 2 failed standbys (Node 0 & Node 1). I tried to restore one of the standbys by performing an Online Recovery. The recovery command waited for longer than usual and then failed. And when it failed, at the same time, pgpool detected the only running (primary) node as an invalid node and shut it down. From what I understand, I see two possible cases:</div>
<ul>
<li>The online recovery failed due to some external reason and that caused pgpool to perform the failover which would be a problem as usually in failed online recoveries, the current primary stays unaffected as it should.</li>
<li>Or the second case might be that the marking of the current primary as invalid node, made the online recovery fail, in which case, I would like to understand the reason why pgpool would mark a single running instance as the false primary?</li>
</ul>
<div dir="auto"><br />
I am more inclined to believe in the first case, since the cluster was working fine before I ran the online recovery command. Here are the logs of the incident:<br />
<br />
<br />
2021-07-12 20:13:21: pid 205: LOG: starting recovering node 0<br />
2021-07-12 20:13:21: pid 205: LOG: executing recovery<br />
2021-07-12 20:13:21: pid 205: DETAIL: starting recovery command: "SELECT pgpool_recovery('recovery_1st_stage', '<a href="http://dbod-ag-pg01.ch" target="_blank">dbod-ag-pg01.ch</a>', '/ORA/dbs03/AG_PG01/data', '6600', 0, '6601')"<br />
2021-07-12 20:13:21: pid 205: LOG: executing recovery<br />
2021-07-12 20:13:21: pid 205: DETAIL: disabling statement_timeout<br />
2021-07-12 20:15:09: pid 126: LOG: get_query_result falied: status: -2<br />
2021-07-12 20:15:09: pid 126: CONTEXT: while checking replication time lag<br />
2021-07-12 20:15:09: pid 126: LOG: Replication of node:1 is behind 43088 bytes from the primary server (node:2)<br />
2021-07-12 20:15:09: pid 126: CONTEXT: while checking replication time lag<br />
2021-07-12 20:15:09: pid 126: LOG: pgpool_worker_child: invalid node found 2<br />
2021-07-12 20:15:09: pid 126: LOG: received degenerate backend request for node_id: 2 from pid [126]<br />
2021-07-12 20:15:09: pid 9: LOG: Pgpool-II parent process has received failover request<br />
2021-07-12 20:15:09: pid 9: LOG: starting degeneration. shutdown host <a href="http://dbod-ag-pg03.ch" target="_blank">dbod-ag-pg03.ch</a>(6600)<br />
2021-07-12 20:15:09: pid 9: WARNING: All the DB nodes are in down status and skip writing status file.<br />
2021-07-12 20:15:09: pid 9: LOG: failover: no valid backend node found<br />
2021-07-12 20:15:09: pid 9: LOG: Restart all children<br />
2021-07-12 20:15:09: pid 9: LOG: execute command: /usr/local/bin/failover.sh 2 <a href="http://dbod-ag-pg03.ch" target="_blank">dbod-ag-pg03.ch</a> 6600 /ORA/dbs03/AG_PG03/data -1 "" 2 2 "" "" <a href="http://dbod-ag-pg03.ch" target="_blank">dbod-ag-pg03.ch</a> 6600<br />
2021-07-12 20:15:09: pid 165: LOG: frontend disconnection: session time: 0:27:55.239 user=postgres database=postgres host=<a href="http://dbod-hac-c02.ch" target="_blank">dbod-hac-c02.ch</a> port=41958<br />
2021-07-12 20:15:09: pid 9: LOG: find_primary_node_repeatedly: all of the backends are down. Giving up finding primary node<br />
2021-07-12 20:15:09: pid 9: LOG: failover: no follow backends are degenerated<br />
2021-07-12 20:15:09: pid 9: LOG: failover: set new primary node: -1<br />
failover done. shutdown host <a href="http://dbod-ag-pg03.ch" target="_blank">dbod-ag-pg03.ch</a>(6600)2021-07-12 20:15:09: pid 9: LOG: failover done. shutdown host <a href="http://dbod-ag-pg03.ch" target="_blank">dbod-ag-pg03.ch</a>(6600)<br />
2021-07-12 20:15:09: pid 126: LOG: worker process received restart request<br />
2021-07-12 20:15:10: pid 200: LOG: restart request received in pcp child process<br />
2021-07-12 20:18:05: pid 212: FATAL: pgpool is not accepting any new connections<br />
2021-07-12 20:18:05: pid 212: DETAIL: all backend nodes are down, pgpool requires at least one valid node<br />
2021-07-12 20:18:05: pid 212: HINT: repair the backend nodes and restart pgpool<br />
2021-07-12 20:18:05: pid 212: LOG: frontend disconnection: session time: 0:00:00.003 user= database= host=<br />
2021-07-12 20:23:25: pid 205: ERROR: executing recovery, execution of command failed at "1st stage"<br />
2021-07-12 20:23:25: pid 205: DETAIL: command:"recovery_1st_stage"<br />
2021-07-12 20:23:25: pid 9: LOG: PCP child 200 exits with status 0 in failover()<br />
2021-07-12 20:23:25: pid 9: LOG: fork a new PCP child pid 240 in failover()<br />
2021-07-12 20:23:25: pid 240: LOG: PCP process: 240 started<br />
<br />
<br />
<strong>Relevant config vars:</strong><br />
failover_on_backend_error = off<br />
detach_false_primary = on<br />
<br />
health_check_period = 10<br />
health_check_timeout = 5<br />
health_check_max_retries = 5<br />
health_check_retry_delay = 2<br />
connect_timeout = 10000<br />
<br />
<br />
<br />
<strong>Snippet of recovery_1st_stage where process failed:</strong><br />
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -t postgres@${DEST_NODE_HOST} "<br />
set -o errexit<br />
echo -e 'Starting online recovery\n' >> ${DEST_NODE_LOG}<br />
<br />
# Disable Puppet<br />
sudo /opt/puppetlabs/bin/puppet agent --disable 'Online recovery of instance: ${DEST_NODE_NAME,,}';<br />
waiting=0<br />
while [[ \${waiting} -lt 60 ]] && sudo [ -f /opt/puppetlabs/puppet/cache/state/agent_catalog_run.lock ];<br />
do waiting=\$((waiting+1)); sleep 10;<br />
done;<br />
if [[ \${waiting} -eq 60 ]]; then<br />
echo 'Waited too long. Exiting'; exit 2<br />
fi<br />
<br />
<br />
<strong>Failure Log:</strong><br />
recovery_1st_stage: start: Online Recovery for Standby node 0<br />
INFO - Initialize kerberos session<br />
Waited too long. Exiting<br />
ERROR: recovery_1st_stage: end: pg_basebackup failed. online recovery failed<br />
<br />
<br />
For Test 1, I would like to understand if it is as per the design of pgpool to not detect storage issues with a node in health check. If pgpool checks the health of postgresql nodes via a ping, maybe we could have that ping write something to a test db to make sure the instance is actually reachable and working? Or at the least, we could add a timeout for queries running via pgpool so that if something like this occurs, the query can terminate automatically after a few seconds.<br />
<br />
For Test 2, if this failover happened because of the failing online recovery, is there something I can do to prevent it in the future, some configuration change or anything?<br /></div>
</div>
<div name="messageSignatureSection"><br />
<div class="matchFont">
<div dir="auto">Thanks in anticipation!
<div dir="auto"><br /></div>
<div dir="auto">Anirudh</div>
</div>
</div>
</div>
</body>
</html>