[pgpool-general: 7508] Odd behaviour for failover and rejoin.

Wed Apr 14 22:00:18 JST 2021

Hi all,

We are seeing some odd behaviour when falling over on Pgpool 4.2.2 which I need some help with.

We have 3 pgpool nodes and 2 database backends, Node 0 and Node 1

At the start of the test node 0 is primary and node 1 is the standby.

When shutdown the database on node 0, this causes pgpool to detect the issue and failover to node 1, node 1 is now the primary - This works fine.

When then go to the old primary(node 0 - currently stopped state) , and re-join it with the following command to the  new primary (node 1)

[postgres at svd-db-01 ~]$ /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node rejoin -d 'host=10.57.5.17 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-12/bin/pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=10.57.5.17 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 0 files copied to /var/lib/pgsql/12/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=10.57.5.16 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "sudo /usr/bin/systemctl start postgresql-12.service"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

This command works fine, and the replication is working as expected:

ID | Name       | Role    | Status    | Upstream   | Location | Priority | Timeline | Connection string
----+------------+---------+-----------+------------+----------+----------+----------+-------------------------------------------------------------
 1  | 10.57.5.16 | standby |   running | 10.57.5.17 | default  | 100      | 26       | host=10.57.5.16 user=repmgr dbname=repmgr connect_timeout=2
 2  | 10.57.5.17 | primary | * running |            | default  | 100      | 26       | host=10.57.5.17 user=repmgr dbname=repmgr connect_timeout=2
[postgres at svd-db-01 ~]$

Standby: 

postgres=# SELECT * FROM pg_stat_wal_receiver;
   pid   |  status   | receive_start_lsn | receive_start_tli | received_lsn | received_tli |      last_msg_send_time      |     last_msg_receipt_time     | latest_end_lsn |       latest_end_time        | slot_name | sender_host | sender_port |
                  conninfo
---------+-----------+-------------------+-------------------+--------------+--------------+------------------------------+-------------------------------+----------------+------------------------------+-----------+-------------+-------------+---------------
------------------------------------------------------------
 1099855 | streaming | 0/38000000        |                26 | 0/3801A420   |           26 | 2021-04-14 12:55:34.99552+00 | 2021-04-14 12:55:34.996361+00 | 0/3801A420     | 2021-04-14 12:55:34.99552+00 |           | 10.57.5.17  |        5432 | user=repmgr co
nnect_timeout=2 host=10.57.5.17 application_name=10.57.5.16
(1 row)

postgres=#

Primary:

postgres=# SELECT * FROM pg_stat_replication;
  pid   | usesysid | usename | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn |    write_lag    |    flush_lag    |   replay_lag
 | sync_priority | sync_state |          reply_time
--------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------------+-----------------+----------------
-+---------------+------------+-------------------------------
 571268 |    16384 | repmgr  | 10.57.5.16       | 10.57.5.16  |                 |       38126 | 2021-04-14 12:41:18.634938+00 |              | streaming | 0/3801A118 | 0/3801A118 | 0/3801A118 | 0/3801A118 | 00:00:00.000713 | 00:00:00.001845 | 00:00:00.001996
 |             0 | async      | 2021-04-14 12:55:19.988998+00
(1 row)

Pgpool detect the new primary, but it does not detect the secondary correctly:

2021-04-14 12:48:36: pid 1208760: DEBUG:  do_query: extended:0 query:"SELECT pg_last_wal_replay_lsn()"
2021-04-14 12:48:36: pid 1208760: CONTEXT:  while checking replication time lag
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG:  do_query: extended:0 query:"SELECT pg_current_wal_lsn()"
2021-04-14 12:48:36: pid 1208760: CONTEXT:  while checking replication time lag
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG:  do_query: extended:0 query:"SELECT application_name, state, sync_state FROM pg_stat_replication"
2021-04-14 12:48:36: pid 1208760: CONTEXT:  while checking replication time lag
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG:  do_query: extended:0 query:"SELECT pg_is_in_recovery()"
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG:  do_query: extended:0 query:"SELECT pg_is_in_recovery()"
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: multiple standbys: 1
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:2953
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: checking connectivity
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:2986
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: 1 is primary
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:2998
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: 0 is standby
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3005
2021-04-14 12:48:36: pid 1208760: DEBUG:  do_query: extended:0 query:"SELECT status, conninfo FROM pg_stat_wal_receiver"
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_process_query.c:1878
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: conninfo for standby 0 is === user=repmgr connect_timeout=2 host=10.57.5.17 application_name=10.57.5.16 ===. host:10.57.5.17 port:
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3043
2021-04-14 12:48:36: pid 1208760: LOG:  verify_backend_node_status: primary 1 does not connect to standby 0
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3063
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: primary 0 owns 0 standbys out of 1
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3079
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: true_primary -1
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3081
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: primary 1 owns 0 standbys out of 1
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3079
2021-04-14 12:48:36: pid 1208760: DEBUG:  verify_backend_node_status: true_primary -1
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3081
2021-04-14 12:48:36: pid 1208760: LOG:  verify_backend_node_status: primary 1 owns only 0 standbys out of 1
2021-04-14 12:48:36: pid 1208760: LOCATION:  pgpool_main.c:3088
2021-04-14 12:48:36: pid 1208760: DEBUG:  node status[0]: 2
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_worker_child.c:185
2021-04-14 12:48:36: pid 1208760: DEBUG:  node status[1]: 1
2021-04-14 12:48:36: pid 1208760: LOCATION:  pool_worker_child.c:185
2021-04-14 12:48:37: pid 1206245: DEBUG:  health check: clearing alarm
2021-04-14 12:48:37: pid 1206245: LOCATION:  health_check.c:369
2021-04-14 12:48:37: pid 1206245: DEBUG:  authenticate kind = 5
2021-04-14 12:48:37: pid 1206245: LOCATION:  pool_auth.c:144
2021-04-14 12:48:37: pid 1206245: DEBUG:  authenticate backend: key data received

The error is:
verify_backend_node_status: primary 1 owns only 0 standbys out of 1

This error will remain here until we restart node 0 (the new standby) at which point it goes away.

Does anyone have any suggestions on what this could be?

Thanks.

Joe Madden
Senior Systems Engineer
D 01412224666      
joe.madden at mottmac.com

-----Original Message-----
From: pgpool-general <pgpool-general-bounces at pgpool.net> On Behalf Of Emond Papegaaij
Sent: 13 April 2021 18:29
To: pgpool-general at pgpool.net
Subject: [pgpool-general: 7505] Re: Problems with pg_rewind

> One possible reason I can think of that may be causing this problem is 
> that our testcases require the creation of snapshots for the vms 
> running the cluster. These snapshots are restored at the beginning of 
> the test. We do make sure the cluster is in a consistent state before 
> we continue with the test, but maybe some part of the cluster is still 
> not entirely in sync. We've changed the re-synchronization code 
> yesterday to fully stop and restart the entire cluster in a reliable 
> way rather than trying to fix a distorted cluster. We haven't seen the 
> error since, but it will require some more runs to be sure. It still 
> is a strange issue, because even if it's caused the restoring the vm 
> snapshots, all pgpools and postgresql databases have already been 
> restarted prior to this failure.

Updating on this possible cause: We've observed the same failure with our new setup. In this setup all pgpool and pg nodes are stopped and restarted in a predictable way. This works great and restores the cluster consistently in a healthy state. However, we still hit the rewind problem.

Also, we've updated pgpool to the latest version, 4.2.2.

Best regards,
Emond
_______________________________________________
pgpool-general mailing list
pgpool-general at pgpool.net
https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pgpool.net%2Fmailman%2Flistinfo%2Fpgpool-general&data=04%7C01%7Cjoe.madden%40mottmac.com%7C1e42bcb011a04c177b0408d8fea1a964%7Ca2bed0c459574f73b0c2a811407590fb%7C0%7C0%7C637539317573567225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QPLb2jm7hNVSAV4O5TT7z5RKvH8wHIBbFFiyEfN041M%3D&reserved=0