[pgpool-hackers: 3332] Re: [pgpool-committers: 5734] pgpool: Fix for duplicate failover request ...

Mon May 27 15:17:19 JST 2019

Hello, pgpool-hackers

I think this patch remains the behavior "when failover_when_quorum_exists=true,
Pgpool-II will perform the degenerate/failover on backend node if the
quorum exists".  Even though the quarantined status was released once,
the quorum check is needed for the next failover.

The document also says "... until the master node is quarantined ... ",
but doesn't say about how/what/when the quarantined status is released.

What does the regression test that will fail with the patch comfirm?
Is there any case that the change makes Pgpool-II 3.7.x worse?

with best regards,
Haruka Takatsuka

On Mon, 27 May 2019 14:25:53 +0900 (JST)
Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > Hello, Usama and pgpool-hackers
> > 
> > Thank you very much for writing and committing patches to solve the issue
> > that I reported.
> > 
> > What do you think about back-patching them into 3.7.x or older versions?
> > Pgpool-II watchdog cluster can easily fall into the status that doesn't
> > work without the patch.
> > 
> > The quarantine feature hasn't been documented in detail up to now, so
> > I think the behavior change of the first patch (*1) has no problem.
> > My customer who has trouble related to this mail thread uses 3.7.x version
> > and strongly hopes such a pgpool improvement.
> > 
> > *1
> > https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=3dd1cd3f1528
> 
> We know that this patch breaks existing regression test. So I don't
> think no user would notice the behavior change if the patch is
> back ported.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
> > Thanks,
> > Haruka Takatsuka
> > 
> > 
> > On Tue, 21 May 2019 12:00:03 +0500
> > Muhammad Usama <m.usama at gmail.com> wrote:
> > 
> >> Hi Ishii-San
> >> 
> >> The discussion on the thread  [pgpool-hackers: 3318] yielded two patches,
> >> one was related
> >> to continuing the health check on the quarantined node and the other one
> >> was related to the
> >> de-escalation and resigning of the master watchdog if the primary backend
> >> node gets into
> >> quarantine state on the master.
> >> So this commit only takes care of the first part that is to continue health
> >> check and I still have
> >> to commit the second patch taking care of the resigning from master status
> >> part. The regression
> >> failure of  test 013.watchdog_failover_require_consensus will also follow
> >> the second patch for this issue.
> >> 
> >> I am sorry I think I am missing something on the part of consensus made in
> >> the discussion,
> >> I think we agreed on the thread to commit both the patch  but only in the
> >> master branch since
> >> it was change of the existing behaviour and we don't wanted to back port it
> >> to older branches.
> >> 
> >> Please see the snippet from our discussion on the thread from which I infer
> >> that we are in agreement
> >> to commit the changes
> >> 
> >> --quote--
> >> ...
> >> > Now if we look at the quarantine nodes, they are just as good as alive
> >> > nodes (but unreachable by pgpool at the moment).
> >> > Because when the node was quarantined, Pgpool-II never executed any
> >> > failover and/or follow_master commands
> >> > and did not interfered with the PostgreSQL backend in any way to alter its
> >> > timeline or recovery states,
> >> > So when the quarantine node becomes reachable again it is safe to
> >> > automatically connect them back to the Pgpool-II
> >> 
> >> Ok, that makes sense.
> >> 
> >> >> >> BTW,
> >> >> >>
> >> >> >> > > When the communication between master/coordinator pgpool and
> >> >> >> > > primary PostgreSQL node is down during a short period
> >> >> >> >
> >> >> >> > I wonder why you don't set appropriate health check retry parameters
> >> >> >> > to avoid such a temporary communication failure in the firs place. A
> >> >> >> > brain surgery to ignore the error reports from Pgpool-II does not
> >> seem
> >> >> >> > to be a sane choice.
> >> >> >>
> >> >> >> The original reporter didn't answer my question. I think it is likely
> >> >> >> a problem of misconfiguraton (should use longer heath check retry).
> >> >> >>
> >> >> >> In summary I think for shorter period communication failure just
> >> >> >> increasing health check parameters is enough. However for longer
> >> >> >> period communication failure, the watchdog node should decline the
> >> >> >> role.
> >> >> >>
> >> >> >
> >> >> > I am sorry I didn't totally get it what you mean here.
> >> >> > Do you mean that the pgpool-II node that has the primary node in
> >> >> quarantine
> >> >> > state should resign from the master/coordinator
> >> >> > pgpool-II node (if it was a master/coordinator) in that case?
> >> >>
> >> >> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
> >> >> keeping the quarantine state is fine because users query could be
> >> >> processed.
> >> >>
> >> >
> >> > Yes that makes total sense. I will make that change as separate patch.
> >> 
> >> Thanks. However this will change existing behavior. Probably we should
> >> make the change against master branch only?
> >> 
> >> --un quote--
> >> 
> >> 
> >> 
> >> On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> >> 
> >> > Usama,
> >> >
> >> > Since this commit regression test/buildfarm are failing:
> >> >
> >> > testing 013.watchdog_failover_require_consensus... failed.
> >> >
> >> > Also I think this commit seems to be against the consensus made in the
> >> > discussion [pgpool-hackers: 3295] thread.
> >> >
> >> > I thought we agreed on [pgpool-hackers: 3318] so that:
> >> >
> >> > ---------------------------------------------------------------------
> >> > Hi Usama,
> >> >
> >> > > Hi
> >> > >
> >> > > I have drafted a patch to make the master watchdog node resigns from
> >> > master
> >> > > responsibilities if it fails to get the consensus for its
> >> > > primary backend node failover request. The patch is still little short on
> >> > > testing but I want to share the early version to get
> >> > > the feedback on behaviour.
> >> > > Also with this implementation the master/coordinator node only resigns
> >> > from
> >> > > being a master
> >> > > when it fails to get the consensus for the primary node failover, but in
> >> > > case of failed consensus for standby node failover
> >> > > no action is taken by the watchdog master node. Do you think master
> >> > should
> >> > > also resign in this case as well ?
> >> >
> >> > I don't think so because still queries can be routed to primary (or
> >> > other standby servers if there are two or more standbys).
> >> >
> >> 
> >> My understand from this part of discussion was that, we agreed to keep the
> >> master status
> >> of the watchdog node if one of the standby node on the pgpool
> >> watchdog-master gets into
> >> quarantine and only go for resignation if the primary gets quarantine.
> >> 
> >> Have I misunderstood something?
> >> 
> >> Thanks
> >> Best regards
> >> Muhammad Usama
> >> 
> >> 
> >> 
> >> 
> >> > ---------------------------------------------------------------------
> >> >
> >> > From: Muhammad Usama <m.usama at gmail.com>
> >> > Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295]
> >> > duplicate failover request ...
> >> > Date: Wed, 15 May 2019 21:40:01 +0000
> >> > Message-ID: <E1hR1d3-0005o4-1k at gothos.postgresql.org>
> >> >
> >> > > Fix for [pgpool-hackers: 3295] duplicate failover request ...
> >> > >
> >> > > Pgpool should keep the backend health check running on quarantined nodes
> >> > so
> >> > > that when the connectivity resumes, they should automatically get removed
> >> > > from the quarantine. Otherwise the temporary network glitch could send
> >> > the node
> >> > > into permanent quarantine state.
> >> > >
> >> > > Branch
> >> > > ------
> >> > > master
> >> > >
> >> > > Details
> >> > > -------
> >> > >
> >> > https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a
> >> > >
> >> > > Modified Files
> >> > > --------------
> >> > > src/main/health_check.c | 28 ++++++++++++++++++++++++----
> >> > > 1 file changed, 24 insertions(+), 4 deletions(-)
> >> > >
> > 
> > _______________________________________________
> > pgpool-hackers mailing list
> > pgpool-hackers at pgpool.net
> > http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> 

______________________________________________________________________
 高塚　遥  harukat at sraoss.co.jp  SRA OSS, Inc. http://www.sraoss.co.jp
 〒171-0022 東京都豊島区南池袋2-32-8 
 TEL: 03-5979-2701  FAX: 03-5979-2702  CellPhone: 080-1292-3396