[pgpool-hackers: 4439] Re: detach_false_primary could make all nodes go down
Tatsuo Ishii
ishii at sraoss.co.jp
Sat Mar 16 22:31:16 JST 2024
>>>> One of the ideas is, performing detach_false_primary only on the
>>>> watchdog neader node if watchdog is enabled. In the leader node, I
>>>> think there's no window between #3 and #4 and detach_false_primary
>>>> will skip node 0: pgpool will not regard node 1 as a false primary.
>>>>
>>>> For additional protection, maybe detach_false_primary should only run
>>>> if quorum exists.
>>>
>>> I have implemented this. Now detach_false_primary only detaches false
>>> primary only if one of followings is true:
>>>
>>> - watchdog is not enabled.
>>>
>>> - watchdog is enabled and quorum exists and leader watchdog node
>>> detected false primary.
>>>
>>> See attached patch for more details. However, with this patch even if
>>> failover_require_consensus is on, detach_false_primary does not
>>> require consensus from other watchdog node. Can we accept this?
>
> I am thinking about some scenarios.
>
> Scenario 1:
>
> Suppose there are 3 pgpool nodes namely pgpool0 (leader), pgpool1,
> pgpool2 and two PostgreSQL backends node0 (primary), node1 (standby).
>
> 1) pgpool0 detects node1 down. Actually node1 is alive but the network
> between pgpool0 and node1 is down.
>
> 2) Since pgpool1 and pgpoo2 do not agree that node1 is in down state,
> node1 becomes quarantine for pgpool0 (assuming
> failover_require_consensus is on).
>
> 3) node1 is accidentally promoted.
>
> 4) pgpool0 happily skips false primary check against node1 since it is
> in quarantine state.
>
> 5) Now pgpool1 and pgpool2 can see two primary PostgreSQL, node0 and
> node1.
>
> I think this situation is harmless as long as frontend access pgpool
> via VIP (attached to pgpool0). Pgpool0 disregards node1 (remember it's
> in quarantine state) anyway. If frontend access pgpool1 or pgpool2,
> it's not good because the frontend may see outdated data on node1
> because it's no longer standby connected to node0.
>
> Scenario 2:
>
> 1) pgpool0 detects node0 down. Actually node0 is alive but the network
> between pgpool0 and node0 is down.
>
> 2) Since pgpool1 and pgpoo2 do not agree that node0 is in down state,
> node0 becomes quarantine for pgpool0 (assuming
> failover_require_consensus is on).
>
> 3) pgpool1 is elected as a new leader.
>
> 4) pgpool1 performs detach_false_primary because it's a watchdog
> leader's task.
>
> In summary, I think the scenario #1 is harmless in common setups. the
> scenario #2 is harmless.
I have pushed the patch to master branch along with some documentation
changes (adding note that detach_false_primary ignores watchdog
concensus). If you find anything wrong with the commit, please let me
know. I will fix.
Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
More information about the pgpool-hackers
mailing list