[pgpool-hackers: 4564] PGPool 4.2.15 shows incorrect pool_nodes
Igor Yurchenko
harry.urcen at gmail.com
Fri Jan 24 00:27:17 JST 2025
Hi guys
My brain get broken. It looks like I cannot handle the issue without your
hint.
In my case Pgpool provides incorrect data on 'show pool_nodes':
[root at pg-mgrdb1 ~]# psql -U fabrix -w -h 10.65.188.59 -p 9999 postgres -c
'show pool_nodes'
node_id | hostname | port | status | lb_weight | role | select_cnt
| load_balance_node | replication_delay | replication_state |
replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | 10.65.188.55 | 5432 | up | 0.500000 | primary | 5638095
| true | 0 | streaming | async
| 2025-01-23 16:02:22
1 | 10.65.188.56 | 5432 | up | 0.500000 | standby | 213106
| false | 0 | |
| 2025-01-23 16:02:23
(2 rows)
[root at pg-mgrdb1 ~]#
Actually, both backends are not in recovery (treated as primary), and
respectively there is no replication between the nodes. Obviously, my
failover/failback scripts work incorrectly, and bring the whole cluster to
broken state.
But why pool_nodes report doesn't match to the reality? The only point that
shows that something is wrong is that the replication_state and
replication_sync_state
data (streaming, async) are shown in wrong line
it should be printed out for standby node.The streaming check is enabled.
sr_check_period set to 1.
It looks like something goes wrong with autofailback. In the pgpool logs I
see this piece:
2025-01-23 16:02:23: pid 1135: LOG: watchdog is informed of failover start
by the main process
2025-01-23 16:02:23: pid 1128: LOG: starting fail back. reconnect host
10.65.188.56(5432)
2025-01-23 16:02:23: pid 1128: LOG: Node 0 is not down (status: 2)
2025-01-23 16:02:23: pid 1128: LOG: execute command:
/etc/pgpool-II/recovery/failback_node.sh 1 10.65.188.56 5432
+ NODE_ID=1
+ NODE_HOST=10.65.188.56
+ NODE_PORT=5432
+ LOG_FILE=/home/postgres/pg_logs/failback.log
++ date
+ echo 'Thu Jan 23 16:02:23 IST 2025: Failback triggered for node 1 at
10.65.188.56:5432'
+ true
+ '[' 0 -eq 0 ']'
++ date
+ echo 'Thu Jan 23 16:02:23 IST 2025: Node 1 successfully reattached.'
2025-01-23 16:02:23: pid 1128: LOG: Do not restart children because we are
failing back node id 1 host: 10.65.188.56 port: 5432 and we are in
streaming replication mode and not all backends were down
2025-01-23 16:02:23: pid 1128: LOG: find_primary_node_repeatedly: follow
primary is ongoing. return current primary: 0
2025-01-23 16:02:23: pid 1128: LOG: failover: set new primary node: 0
2025-01-23 16:02:23: pid 1128: LOG: failover: set new main node: 0
2025-01-23 16:02:23: pid 1135: LOG: received the failover indication from
Pgpool-II on IPC interface
2025-01-23 16:02:23: pid 1135: LOG: watchdog is informed of failover end
by the main process
2025-01-23 16:02:23: pid 19776: LOG: worker process received restart
request
2025-01-23 16:02:23: pid 1128: LOG: failback done. reconnect host
10.65.188.56(5432)
2025-01-23 16:02:23: pid 19780: LOG: selecting backend connection
2025-01-23 16:02:23: pid 19780: DETAIL: failover or failback event
detected, discarding existing connections
Here is mentioned that "follow primary is ongoing". But last call for
follow_primary.sh was pretty long time ago, so...
Actually, I have here two questions:
1) Why pgpool provide incorrect states of the backends?
2) What is wrong with follow_primary procedure? Is there a sense to use
follow_primary for two nodes?
My pgpool.conf, failback/failover scripts and log are available for 5 days
here: https://filebin.net/076qbpqicx3rffik
I'd be highly appreciated to any advice.
BR
Igor Yurchenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20250123/62b1e86d/attachment.htm>
More information about the pgpool-hackers
mailing list