[pgpool-hackers: 4266] Re: Watchdog heartbeat issue

Thu Jan 19 01:28:14 JST 2023

On Wed, Jan 18, 2023 at 6:37 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > On Tue, Jan 17, 2023 at 6:49 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> >
> >> Hi Usama,
> >>
> >> Thank you for investigating the issue.
> >>
> >> > Hi Ishii San
> >> >
> >> > Thanks for figuring out the issue.
> >> > I think removing the code in question altogether could mark the remote
> >> node
> >> > as dead too early at startup and can delay the watchdog cluster
> >> > stabilization
> >> > when there is a few seconds delay between the node startup.
> >> > So IMHO the way to solve this is to wait for twice the wd_interval or
> >> > wd_heartbeat_deadtime (depending on the configuration) if
> >> > is_wd_lifecheck_ready()
> >> > reports a failure.
> >> >
> >> > What do you think of the attached patch?
> >>
> >> Probably I am missing something but I wonder why the watchdog leader
> >> node's lifecheck does not notice that node 1 watchdog will never send
> >> hearbeat signal. In the pgpool0 log:
> >>
> >> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  read from socket failed,
> >> remote end closed the connection
> >> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  client socket of
> >> localhost:50004 Linux abf1b59af489 is closed
> >> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  remote node
> >> "localhost:50004 Linux abf1b59af489" is shutting down
> >> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  removing watchdog node
> >> "localhost:50004 Linux abf1b59af489" from the standby list
> >>
> >> It seems the leader watchdog alreay noticed that node 1 was down.
> >>
> >
> > When the watchdog fails to communicate with a remote node despite
> retries,
> > it marks the node status to lost/down. As for the lifecheck
> > process, it only informs the node-down status to the watchdog process
> when
> > the heartbeat breaks after at least one successful heartbeat cycle
> > is completed.
>
> I see.
>
> I would like to confirm if my understanding is correct.
>
> There are 3 nodes configured. Node 0 and 1 started but node 2 did not
> start.  In this case I think lifecheck does not start on node 0 and
> node 1 because lifecheck process is waiting for node 2 comes up.
>

Apparently that is the behavior, and that is wrong.

Regards
Muhammad Usama

> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20230118/bf8ddec4/attachment.htm>