[pgpool-hackers: 4439] Re: detach_false_primary could make all nodes go down

Sat Mar 16 22:31:16 JST 2024

>>>> One of the ideas is, performing detach_false_primary only on the
>>>> watchdog neader node if watchdog is enabled. In the leader node, I
>>>> think there's no window between #3 and #4 and detach_false_primary
>>>> will skip node 0: pgpool will not regard node 1 as a false primary.
>>>> 
>>>> For additional protection, maybe detach_false_primary should only run
>>>> if quorum exists.
>>> 
>>> I have implemented this. Now detach_false_primary only detaches false
>>> primary only if one of followings is true:
>>> 
>>> - watchdog is not enabled.
>>> 
>>> - watchdog is enabled and quorum exists and leader watchdog node
>>>   detected false primary.
>>> 
>>> See attached patch for more details. However, with this patch even if
>>> failover_require_consensus is on, detach_false_primary does not
>>> require consensus from other watchdog node.  Can we accept this?
> 
> I am thinking about some scenarios.
> 
> Scenario 1:
> 
> Suppose there are 3 pgpool nodes namely pgpool0 (leader), pgpool1,
> pgpool2 and two PostgreSQL backends node0 (primary), node1 (standby).
> 
> 1) pgpool0 detects node1 down. Actually node1 is alive but the network
>    between pgpool0 and node1 is down.
> 
> 2) Since pgpool1 and pgpoo2 do not agree that node1 is in down state,
>    node1 becomes quarantine for pgpool0 (assuming
>    failover_require_consensus is on).
> 
> 3) node1 is accidentally promoted.
> 
> 4) pgpool0 happily skips false primary check against node1 since it is
>    in quarantine state.
> 
> 5) Now pgpool1 and pgpool2 can see two primary PostgreSQL, node0 and
>    node1.
> 
> I think this situation is harmless as long as frontend access pgpool
> via VIP (attached to pgpool0). Pgpool0 disregards node1 (remember it's
> in quarantine state) anyway. If frontend access pgpool1 or pgpool2,
> it's not good because the frontend may see outdated data on node1
> because it's no longer standby connected to node0.
> 
> Scenario 2:
> 
> 1) pgpool0 detects node0 down. Actually node0 is alive but the network
>    between pgpool0 and node0 is down.
> 
> 2) Since pgpool1 and pgpoo2 do not agree that node0 is in down state,
>    node0 becomes quarantine for pgpool0 (assuming
>    failover_require_consensus is on).
> 
> 3) pgpool1 is elected as a new leader.
> 
> 4) pgpool1 performs detach_false_primary because it's a watchdog
>    leader's task.
> 
> In summary, I think the scenario #1 is harmless in common setups. the
> scenario #2 is harmless.

I have pushed the patch to master branch along with some documentation
changes (adding note that detach_false_primary ignores watchdog
concensus). If you find anything wrong with the commit, please let me
know. I will fix.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp