[pgpool-hackers: 4548] Re: PGPool4.2 changing leader role and healthcheck

Thu Dec 5 16:19:00 JST 2024

Hi,

> 1. I have 2 pgpool instances that watch each other and handling pgpool VIP.
> I see that when a current pgpool leader comes down, the role switched and
> VIP moved with significant delay. In logs I see a this picture:
>
> It has siginficant delays at 14:40:12 and on acquiring the VIP at 14:40:16.
> The quorum settings in gpgool.conf are:
> 
> failover_when_quorum_exists=off
> failover_require_consensus=on
> allow_multiple_failover_requests_from_node=off

These parameters are configured for PostgreSQL failover behavior, not for Pgpool-II leader node switchover.

If you want to reduce the time required for a Pgpool-II leader node switchover,
you can decrease the values of the parameters below:

 wd_interval = 10
 wd_heartbeat_deadtime = 30

> 2. The second question is about a health check logics. I get right that if
> a backend comes to down state, his health check gets stopped?
> If yes, how can I ensure that a failed backend comes back (after hardware
> issue for example), and should be recovered?
> Or it's impossible within pgpool and I should use third-party gears for
> tracking backends and triggering the recovering?

Only a failed standby node can be reattached to pgpool automatically by setting "auto_failback = on" when it recovers.
A failed primary node cannot be reattached to pgpool automatically. You need to recover it manually.

On Mon, 2 Dec 2024 19:26:23 +0200
Igor Yurchenko <harry.urcen at gmail.com> wrote:

> Hi guys
> 
> Need you hints on some weird behaviors of PGPool 4.2.
> 
> 1. I have 2 pgpool instances that watch each other and handling pgpool VIP.
> I see that when a current pgpool leader comes down, the role switched and
> VIP moved with significant delay. In logs I see a this picture:
> 
> 2024-12-02 14:40:12: pid 1286: LOG:  watchdog node state changed from
> [INITIALIZING] to [LEADER]
> 2024-12-02 14:40:12: pid 1286: LOG:  Setting failover command timeout to 1
> 2024-12-02 14:40:12: pid 1286: LOG:  I am announcing my self as
> leader/coordinator watchdog node
> 2024-12-02 14:40:16: pid 1286: LOG:  I am the cluster leader node
> 2024-12-02 14:40:16: pid 1286: DETAIL:  our declare coordinator message is
> accepted by all nodes
> 2024-12-02 14:40:16: pid 1286: LOG:  setting the local node "
> 10.65.188.56:9999 Linux pg-mgrdb2" as watchdog cluster leader
> 2024-12-02 14:40:16: pid 1286: LOG:  signal_user1_to_parent_with_reason(1)
> 2024-12-02 14:40:16: pid 1286: LOG:  I am the cluster leader node. Starting
> escalation process
> 2024-12-02 14:40:16: pid 1281: LOG:  Pgpool-II parent process received
> SIGUSR1
> 2024-12-02 14:40:16: pid 1281: LOG:  Pgpool-II parent process received
> watchdog state change signal from watchdog
> 2024-12-02 14:40:16: pid 1286: LOG:  escalation process started with
> PID:4855
> 2024-12-02 14:40:16: pid 4855: LOG:  watchdog: escalation started
> 2024-12-02 14:40:20: pid 4855: LOG:  successfully acquired the delegate
> IP:"10.65.188.59"
> 2024-12-02 14:40:20: pid 4855: DETAIL:  'if_up_cmd' returned with success
> 2024-12-02 14:40:20: pid 1286: LOG:  watchdog escalation process with pid:
> 4855 exit with SUCCESS.
> 
> It has siginficant delays at 14:40:12 and on acquiring the VIP at 14:40:16.
> The quorum settings in gpgool.conf are:
> 
> failover_when_quorum_exists=off
> failover_require_consensus=on
> allow_multiple_failover_requests_from_node=off
> 
> So I nave no idea why it happens.
> 
> 2. The second question is about a health check logics. I get right that if
> a backend comes to down state, his health check gets stopped?
> If yes, how can I ensure that a failed backend comes back (after hardware
> issue for example), and should be recovered?
> Or it's impossible within pgpool and I should use third-party gears for
> tracking backends and triggering the recovering?
> 
> BR
> Igor Yurchenko

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS K.K.
TEL: 03-5979-2701 FAX: 03-5979-2702
URL: https://www.sraoss.co.jp/