[pgpool-hackers: 4564] PGPool 4.2.15 shows incorrect pool_nodes

Fri Jan 24 00:27:17 JST 2025

Hi guys

My brain get broken. It looks like I cannot handle the issue without your
hint.
In my case Pgpool provides incorrect data on 'show pool_nodes':

[root at pg-mgrdb1 ~]# psql -U fabrix -w -h 10.65.188.59 -p 9999 postgres -c
'show pool_nodes'
 node_id |   hostname   | port | status | lb_weight |  role   | select_cnt
| load_balance_node | replication_delay | replication_state |
replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 10.65.188.55 | 5432 | up     | 0.500000  | primary | 5638095
 | true              | 0                 | streaming         | async
           | 2025-01-23 16:02:22
 1       | 10.65.188.56 | 5432 | up     | 0.500000  | standby | 213106
| false             | 0                 |                   |
         | 2025-01-23 16:02:23
(2 rows)

[root at pg-mgrdb1 ~]#

Actually, both backends are not in recovery (treated as primary), and
respectively there is no replication between the nodes. Obviously, my
failover/failback scripts work incorrectly, and bring the whole cluster to
broken state.
But why pool_nodes report doesn't match to the reality? The only point that
shows that something is wrong is that the replication_state and
replication_sync_state
data (streaming, async) are shown in wrong line
it should be printed out for standby node.The streaming check is enabled.
sr_check_period set to 1.
It looks like something goes wrong with autofailback. In the pgpool logs I
see this piece:

2025-01-23 16:02:23: pid 1135: LOG:  watchdog is informed of failover start
by the main process
2025-01-23 16:02:23: pid 1128: LOG:  starting fail back. reconnect host
10.65.188.56(5432)
2025-01-23 16:02:23: pid 1128: LOG:  Node 0 is not down (status: 2)
2025-01-23 16:02:23: pid 1128: LOG:  execute command:
/etc/pgpool-II/recovery/failback_node.sh 1 10.65.188.56 5432
+ NODE_ID=1
+ NODE_HOST=10.65.188.56
+ NODE_PORT=5432
+ LOG_FILE=/home/postgres/pg_logs/failback.log
++ date
+ echo 'Thu Jan 23 16:02:23 IST 2025: Failback triggered for node 1 at
10.65.188.56:5432'
+ true
+ '[' 0 -eq 0 ']'
++ date
+ echo 'Thu Jan 23 16:02:23 IST 2025: Node 1 successfully reattached.'
2025-01-23 16:02:23: pid 1128: LOG:  Do not restart children because we are
failing back node id 1 host: 10.65.188.56 port: 5432 and we are in
streaming replication mode and not all backends were down
2025-01-23 16:02:23: pid 1128: LOG:  find_primary_node_repeatedly: follow
primary is ongoing. return current primary: 0
2025-01-23 16:02:23: pid 1128: LOG:  failover: set new primary node: 0
2025-01-23 16:02:23: pid 1128: LOG:  failover: set new main node: 0
2025-01-23 16:02:23: pid 1135: LOG:  received the failover indication from
Pgpool-II on IPC interface
2025-01-23 16:02:23: pid 1135: LOG:  watchdog is informed of failover end
by the main process
2025-01-23 16:02:23: pid 19776: LOG:  worker process received restart
request
2025-01-23 16:02:23: pid 1128: LOG:  failback done. reconnect host
10.65.188.56(5432)
2025-01-23 16:02:23: pid 19780: LOG:  selecting backend connection
2025-01-23 16:02:23: pid 19780: DETAIL:  failover or failback event
detected, discarding existing connections

Here is mentioned that "follow primary is ongoing". But last call for
follow_primary.sh was pretty long time ago, so...

Actually, I have here two questions:
1) Why pgpool provide incorrect states of the backends?
2) What is wrong with follow_primary procedure? Is there a sense to use
follow_primary for two nodes?

My pgpool.conf, failback/failover scripts and log are available for 5 days
here: https://filebin.net/076qbpqicx3rffik
I'd be highly appreciated to any advice.

BR
Igor Yurchenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20250123/62b1e86d/attachment.htm>