[pgpool-general: 8543] Re: Issues taking a node out of a cluster

Mon Jan 16 09:33:38 JST 2023

Sorry for delay.

> Hi all,
> 
> We are seeing failures in our test suite on a specific set of tests related
> to taking a node out of a cluster. In short, it seems to following sequence
> of events occurs:
> * We start with a health cluster with 3 nodes (0, 1 and 2), each node
> running pgpool and postgresql. Node 0 runs the primary database.
> * node 1 is shutdown
> * pgpool on node 0 and 2 correctly mark backend 1 down
> * pgpool on node 0 is reconfigured, removing node 1 from the configuration,
> backend 0 remains backend 0, backend 2 is now known as backend 1
> * pgpool on node 0 starts up again, and receives the cluster status from
> node 2, which includes backend 1 being down.
> * pgpool on node 0 now also marks backend 1 as being down, but because of
> the renumbering, it actually marks the backend on node 2 as down
> * pgpool on node 2 gets its new configuration, same as on node 0
> * pgpool on node 2 (which is now runs backend 1) gets the cluster status
> from node 0, and marks backend 1 down
> * the cluster ends up with pgpool and postgresql running on both remaining
> nodes, but backend 1 is down. It never recovers from this state
> automatically, even though auto_failback is enabled and postgresql is up
> and streaming.
> 
> For node 2 (with backend 1), pcp_node_info returns the following
> information for backend 1:
> Hostname               : 172.29.30.3
> Port                   : 5432
> Status                 : 3
> Weight                 : 0.500000
> Status Name            : down
> Backend Status Name    : up
> Role                   : standby
> Backend Role           : standby
> Replication Delay      : 0
> Replication State      : streaming
> Replication Sync State : async
> Last Status Change     : 2023-01-09 22:28:41
> 
> My first question is: Can we somehow prevent the state of backend 1 being
> assigned to the wrong node during the configuration update?

Have you removed pgpool_status file before restarting pgpool?  The
file remembers the backend status along with node id hence you need to
update the file. If the file does not exist upon pgpol startup, it
will be automatically created.

> My second question: Why does the auto_failback not reattach backend 1 when
> it detects the database is up and streaming?

Maybe because of this?

https://www.pgpool.net/docs/44/en/html/runtime-config-failover.html#RUNTIME-CONFIG-FAILOVER-SETTINGS

> Note: auto_failback may not work, when replication slot is used. There
> is possibility that the streaming replication is stopped, because
> failover_command is executed and replication slot is deleted by the
> command.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp