[pgpool-general: 8642] online recovery causes node to loose replication state.

Thu Mar 16 03:33:13 JST 2023

Hi All,
I'm running a 3 node cluster based on PostgreSQL 14 and pgpool 4.4.1.  I'm having an issue while testing failover scenarios.

At start of scenario:

node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:56:31
1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:56:38
2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 13:46:29
(3 rows)

After stopping postgres on node0

node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | catvmtspg04a.domain.net | 5432 | down   | down      | 0.333333  | standby | unknown | 0          | false             | 0                 |                   |                        | 2023-03-15 12:34:05
1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:34:05
2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:34:09
(3 rows)

After running recovery on node 0
pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 0
node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:55:59
1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:34:05
2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:34:09
(3 rows)

After promoting node1 so that node0 becomes primary again:
pcp_promote_node -s -n 0 -U pgpool
Note that Ioose the replication_state and replication_sync_state on node2, would like to know why this happens.
node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:56:31
1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:56:38
2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 |                   |                        | 2023-03-15 12:56:38
(3 rows)

My process to get the cluster back to normal.  Don't understand why the recovery process is failing...  Eventually, it works...
pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2

ERROR:  recovery is checking if postmaster is started
DETAIL:  postmaster on hostname:"catvmtspg04c.domain.net" database:"template1" user:"postgres" failed to start in 0 second

[postgres at catvmtspg04a pgpool-II]$ pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
Password:
ERROR:  recovery is checking if postmaster is started
DETAIL:  postmaster on hostname:"catvmtspg04c.domain.net" database:"template1" user:"postgres" failed to start in 0 second

[postgres at catvmtspg04a pgpool-II]$ pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
Password:
pcp_recovery_node -- Command Successful
cluster back to normal.
[postgres at catvmtspg04a pgpool-II]$ psql -h 10.78.121.3 -p 9999 -U pgpool postgres -c "show pool_nodes"
Password for user pgpool:
node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:56:31
1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:56:38
2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 13:46:29
(3 rows)

Regards,

Todd Stein
OpsBridge Technical Success
OpenText
(Cell) +1 (941) 248-8752
tstein2 at opentext.com<mailto:tstein2 at opentext.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230315/e17dc323/attachment.htm>