[pgpool-general: 8644] Re: online recovery causes node to loose replication state.

Thu Mar 16 12:01:32 JST 2023

To investigate the issue we need pgpool.conf, postgresql.conf and
PostgreSQL logs. Can you share? (of course you can hide password
etc. sensitive information).

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi All,
> I'm running a 3 node cluster based on PostgreSQL 14 and pgpool 4.4.1.  I'm having an issue while testing failover scenarios.
> 
> At start of scenario:
> 
> node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
> ---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> 0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:56:31
> 1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:56:38
> 2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 13:46:29
> (3 rows)
> 
> After stopping postgres on node0
> 
> node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
> ---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> 0       | catvmtspg04a.domain.net | 5432 | down   | down      | 0.333333  | standby | unknown | 0          | false             | 0                 |                   |                        | 2023-03-15 12:34:05
> 1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:34:05
> 2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:34:09
> (3 rows)
> 
> After running recovery on node 0
> pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 0
> node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
> ---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> 0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:55:59
> 1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:34:05
> 2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:34:09
> (3 rows)
> 
> After promoting node1 so that node0 becomes primary again:
> pcp_promote_node -s -n 0 -U pgpool
> Note that Ioose the replication_state and replication_sync_state on node2, would like to know why this happens.
> node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
> ---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> 0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:56:31
> 1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:56:38
> 2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 |                   |                        | 2023-03-15 12:56:38
> (3 rows)
> 
> My process to get the cluster back to normal.  Don't understand why the recovery process is failing...  Eventually, it works...
> pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
> 
> ERROR:  recovery is checking if postmaster is started
> DETAIL:  postmaster on hostname:"catvmtspg04c.domain.net" database:"template1" user:"postgres" failed to start in 0 second
> 
> [postgres at catvmtspg04a pgpool-II]$ pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
> Password:
> ERROR:  recovery is checking if postmaster is started
> DETAIL:  postmaster on hostname:"catvmtspg04c.domain.net" database:"template1" user:"postgres" failed to start in 0 second
> 
> [postgres at catvmtspg04a pgpool-II]$ pcp_recovery_node -h 10.78.121.3 -p 9898 -U postgres -n 2
> Password:
> pcp_recovery_node -- Command Successful
> cluster back to normal.
> [postgres at catvmtspg04a pgpool-II]$ psql -h 10.78.121.3 -p 9999 -U pgpool postgres -c "show pool_nodes"
> Password for user pgpool:
> node_id |           hostname           | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
> ---------+------------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> 0       | catvmtspg04a.domain.net | 5432 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-03-15 12:56:31
> 1       | catvmtspg04b.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 12:56:38
> 2       | catvmtspg04c.domain.net | 5432 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2023-03-15 13:46:29
> (3 rows)
> 
> 
> Regards,
> 
> Todd Stein
> OpsBridge Technical Success
> OpenText
> (Cell) +1 (941) 248-8752
> tstein2 at opentext.com<mailto:tstein2 at opentext.com>
>