[pgpool-general: 7508] Odd behaviour for failover and rejoin.
Joe Madden
Joe.Madden at mottmac.com
Wed Apr 14 22:00:18 JST 2021
Hi all,
We are seeing some odd behaviour when falling over on Pgpool 4.2.2 which I need some help with.
We have 3 pgpool nodes and 2 database backends, Node 0 and Node 1
At the start of the test node 0 is primary and node 1 is the standby.
When shutdown the database on node 0, this causes pgpool to detect the issue and failover to node 1, node 1 is now the primary - This works fine.
When then go to the old primary(node 0 - currently stopped state) , and re-join it with the following command to the new primary (node 1)
[postgres at svd-db-01 ~]$ /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node rejoin -d 'host=10.57.5.17 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-12/bin/pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=10.57.5.17 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 0 files copied to /var/lib/pgsql/12/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=10.57.5.16 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "sudo /usr/bin/systemctl start postgresql-12.service"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
This command works fine, and the replication is working as expected:
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------------+---------+-----------+------------+----------+----------+----------+-------------------------------------------------------------
1 | 10.57.5.16 | standby | running | 10.57.5.17 | default | 100 | 26 | host=10.57.5.16 user=repmgr dbname=repmgr connect_timeout=2
2 | 10.57.5.17 | primary | * running | | default | 100 | 26 | host=10.57.5.17 user=repmgr dbname=repmgr connect_timeout=2
[postgres at svd-db-01 ~]$
Standby:
postgres=# SELECT * FROM pg_stat_wal_receiver;
pid | status | receive_start_lsn | receive_start_tli | received_lsn | received_tli | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time | slot_name | sender_host | sender_port |
conninfo
---------+-----------+-------------------+-------------------+--------------+--------------+------------------------------+-------------------------------+----------------+------------------------------+-----------+-------------+-------------+---------------
------------------------------------------------------------
1099855 | streaming | 0/38000000 | 26 | 0/3801A420 | 26 | 2021-04-14 12:55:34.99552+00 | 2021-04-14 12:55:34.996361+00 | 0/3801A420 | 2021-04-14 12:55:34.99552+00 | | 10.57.5.17 | 5432 | user=repmgr co
nnect_timeout=2 host=10.57.5.17 application_name=10.57.5.16
(1 row)
postgres=#
Primary:
postgres=# SELECT * FROM pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag
| sync_priority | sync_state | reply_time
--------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------------+-----------------+----------------
-+---------------+------------+-------------------------------
571268 | 16384 | repmgr | 10.57.5.16 | 10.57.5.16 | | 38126 | 2021-04-14 12:41:18.634938+00 | | streaming | 0/3801A118 | 0/3801A118 | 0/3801A118 | 0/3801A118 | 00:00:00.000713 | 00:00:00.001845 | 00:00:00.001996
| 0 | async | 2021-04-14 12:55:19.988998+00
(1 row)
Pgpool detect the new primary, but it does not detect the secondary correctly:
2021-04-14 12:48:36: pid 1208760: DEBUG: do_query: extended:0 query:"SELECT pg_last_wal_replay_lsn()"
2021-04-14 12:48:36: pid 1208760: CONTEXT: while checking replication time lag
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG: do_query: extended:0 query:"SELECT pg_current_wal_lsn()"
2021-04-14 12:48:36: pid 1208760: CONTEXT: while checking replication time lag
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG: do_query: extended:0 query:"SELECT application_name, state, sync_state FROM pg_stat_replication"
2021-04-14 12:48:36: pid 1208760: CONTEXT: while checking replication time lag
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG: do_query: extended:0 query:"SELECT pg_is_in_recovery()"
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG: do_query: extended:0 query:"SELECT pg_is_in_recovery()"
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: multiple standbys: 1
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:2953
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: checking connectivity
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:2986
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: 1 is primary
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:2998
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: 0 is standby
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3005
2021-04-14 12:48:36: pid 1208760: DEBUG: do_query: extended:0 query:"SELECT status, conninfo FROM pg_stat_wal_receiver"
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: conninfo for standby 0 is === user=repmgr connect_timeout=2 host=10.57.5.17 application_name=10.57.5.16 ===. host:10.57.5.17 port:
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3043
2021-04-14 12:48:36: pid 1208760: LOG: verify_backend_node_status: primary 1 does not connect to standby 0
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3063
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: primary 0 owns 0 standbys out of 1
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3079
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: true_primary -1
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3081
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: primary 1 owns 0 standbys out of 1
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3079
2021-04-14 12:48:36: pid 1208760: DEBUG: verify_backend_node_status: true_primary -1
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3081
2021-04-14 12:48:36: pid 1208760: LOG: verify_backend_node_status: primary 1 owns only 0 standbys out of 1
2021-04-14 12:48:36: pid 1208760: LOCATION: pgpool_main.c:3088
2021-04-14 12:48:36: pid 1208760: DEBUG: node status[0]: 2
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_worker_child.c:185
2021-04-14 12:48:36: pid 1208760: DEBUG: node status[1]: 1
2021-04-14 12:48:36: pid 1208760: LOCATION: pool_worker_child.c:185
2021-04-14 12:48:37: pid 1206245: DEBUG: health check: clearing alarm
2021-04-14 12:48:37: pid 1206245: LOCATION: health_check.c:369
2021-04-14 12:48:37: pid 1206245: DEBUG: authenticate kind = 5
2021-04-14 12:48:37: pid 1206245: LOCATION: pool_auth.c:144
2021-04-14 12:48:37: pid 1206245: DEBUG: authenticate backend: key data received
The error is:
verify_backend_node_status: primary 1 owns only 0 standbys out of 1
This error will remain here until we restart node 0 (the new standby) at which point it goes away.
Does anyone have any suggestions on what this could be?
Thanks.
Joe Madden
Senior Systems Engineer
D 01412224666
joe.madden at mottmac.com
-----Original Message-----
From: pgpool-general <pgpool-general-bounces at pgpool.net> On Behalf Of Emond Papegaaij
Sent: 13 April 2021 18:29
To: pgpool-general at pgpool.net
Subject: [pgpool-general: 7505] Re: Problems with pg_rewind
> One possible reason I can think of that may be causing this problem is
> that our testcases require the creation of snapshots for the vms
> running the cluster. These snapshots are restored at the beginning of
> the test. We do make sure the cluster is in a consistent state before
> we continue with the test, but maybe some part of the cluster is still
> not entirely in sync. We've changed the re-synchronization code
> yesterday to fully stop and restart the entire cluster in a reliable
> way rather than trying to fix a distorted cluster. We haven't seen the
> error since, but it will require some more runs to be sure. It still
> is a strange issue, because even if it's caused the restoring the vm
> snapshots, all pgpools and postgresql databases have already been
> restarted prior to this failure.
Updating on this possible cause: We've observed the same failure with our new setup. In this setup all pgpool and pg nodes are stopped and restarted in a predictable way. This works great and restores the cluster consistently in a healthy state. However, we still hit the rewind problem.
Also, we've updated pgpool to the latest version, 4.2.2.
Best regards,
Emond
_______________________________________________
pgpool-general mailing list
pgpool-general at pgpool.net
https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pgpool.net%2Fmailman%2Flistinfo%2Fpgpool-general&data=04%7C01%7Cjoe.madden%40mottmac.com%7C1e42bcb011a04c177b0408d8fea1a964%7Ca2bed0c459574f73b0c2a811407590fb%7C0%7C0%7C637539317573567225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QPLb2jm7hNVSAV4O5TT7z5RKvH8wHIBbFFiyEfN041M%3D&reserved=0
More information about the pgpool-general
mailing list