[pgpool-general: 8642] online recovery causes node to loose replication state.
Todd Stein
todd.stein at microfocus.com
Thu Mar 16 03:33:13 JST 2023
Hi All,
I'm running a 3 node cluster based on PostgreSQL 14 and pgpool 4.4.1. I'm having an issue while testing failover scenarios.
At start of scenario:
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | catvmtspg04a.domain.net | 5432 | up | up | 0.333333 | primary | primary | 0 | true | 0 | | | 2023-03-15 12:56:31
1 | catvmtspg04b.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 12:56:38
2 | catvmtspg04c.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 13:46:29
(3 rows)
After stopping postgres on node0
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | catvmtspg04a.domain.net | 5432 | down | down | 0.333333 | standby | unknown | 0 | false | 0 | | | 2023-03-15 12:34:05
1 | catvmtspg04b.domain.net | 5432 | up | up | 0.333333 | primary | primary | 0 | true | 0 | | | 2023-03-15 12:34:05
2 | catvmtspg04c.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 12:34:09
(3 rows)
After running recovery on node 0
pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 0
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | catvmtspg04a.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 12:55:59
1 | catvmtspg04b.domain.net | 5432 | up | up | 0.333333 | primary | primary | 0 | true | 0 | | | 2023-03-15 12:34:05
2 | catvmtspg04c.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 12:34:09
(3 rows)
After promoting node1 so that node0 becomes primary again:
pcp_promote_node -s -n 0 -U pgpool
Note that Ioose the replication_state and replication_sync_state on node2, would like to know why this happens.
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | catvmtspg04a.domain.net | 5432 | up | up | 0.333333 | primary | primary | 0 | true | 0 | | | 2023-03-15 12:56:31
1 | catvmtspg04b.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 12:56:38
2 | catvmtspg04c.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | | | 2023-03-15 12:56:38
(3 rows)
My process to get the cluster back to normal. Don't understand why the recovery process is failing... Eventually, it works...
pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
ERROR: recovery is checking if postmaster is started
DETAIL: postmaster on hostname:"catvmtspg04c.domain.net" database:"template1" user:"postgres" failed to start in 0 second
[postgres at catvmtspg04a pgpool-II]$ pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
Password:
ERROR: recovery is checking if postmaster is started
DETAIL: postmaster on hostname:"catvmtspg04c.domain.net" database:"template1" user:"postgres" failed to start in 0 second
[postgres at catvmtspg04a pgpool-II]$ pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
Password:
pcp_recovery_node -- Command Successful
cluster back to normal.
[postgres at catvmtspg04a pgpool-II]$ psql -h 10.78.121.3 -p 9999 -U pgpool postgres -c "show pool_nodes"
Password for user pgpool:
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | catvmtspg04a.domain.net | 5432 | up | up | 0.333333 | primary | primary | 0 | true | 0 | | | 2023-03-15 12:56:31
1 | catvmtspg04b.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 12:56:38
2 | catvmtspg04c.domain.net | 5432 | up | up | 0.333333 | standby | standby | 0 | false | 0 | streaming | async | 2023-03-15 13:46:29
(3 rows)
Regards,
Todd Stein
OpsBridge Technical Success
OpenText
(Cell) +1 (941) 248-8752
tstein2 at opentext.com<mailto:tstein2 at opentext.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230315/e17dc323/attachment.htm>
More information about the pgpool-general
mailing list