[pgpool-hackers: 4270] Re: heartbeat behavior

Sat Jan 21 20:46:41 JST 2023

Oops. I attached wrong patch.
Try again...

From: Tatsuo Ishii <ishii at sraoss.co.jp>
Subject: [pgpool-hackers: 4269] heartbeat behavior
Date: Sat, 21 Jan 2023 20:38:58 +0900 (JST)
Message-ID: <20230121.203858.1235299725568995991.t-ishii at sranhm.sra.co.jp>

> Hi Usama,
> 
> I wanted to test heartbeat of Pgpool-II and created a test support
> tool, which allows to stop sending heartbeat packet to specified
> pgpool node. Attached is the patch to implement that.
> 
> While testing heartbeat by using the tool, I found an interesting
> behavior of watchdog.
> 
> 1) create 3 pgpool node cluster
> 
> $ watchdog_setup -wn 3
> 
> 2) start the cluster and waiting for lifekeeper process starting.
> 
> 3) create "heartbeat_sender_control" file to prevent hearbeat sender
> on pgpool1 from sending heartbeart packet to pgpool2.
> 
> $ echo 2 > pgpool1/log/heartbeat_sender_control
> 
> 4) create "heartbeat_sender_control" file to prevent hearbeat sender
> on pgpool2 from sending heartbeart packet to pgpool1.
> 
> $ echo 1 > pgpool2/log/heartbeat_sender_control
> 
> At this point pgpool1 sends hearbeart to pgpool0 but does not send to
> pgpool2.  Also pgpool2 sends hearbeart to pgpool0 but does not send to
> pgpool1.
> 
> 5) wait until life_check reports node is in "NODE DEAD".
> 
> Here is a pgpool log from pgpool1.
> 
> 2023-01-21 20:14:13.541: life_check pid 598177: LOG:  informing the node status change to watchdog
> 2023-01-21 20:14:13.541: life_check pid 598177: DETAIL:  node id :2 status = "NODE DEAD" message:"No heartbeat signal from node"
> 2023-01-21 20:14:13.541: watchdog pid 598125: LOG:  received node status change ipc message
> 2023-01-21 20:14:13.541: watchdog pid 598125: DETAIL:  No heartbeat signal from node
> 2023-01-21 20:14:13.541: watchdog pid 598125: LOG:  remote node "localhost:50008 Linux tishii-CFSV9-2" is lost
> 2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  new watchdog node connection is received from "127.0.0.1:44185"
> 2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  new outbound connection to localhost:50010 
> 2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  new node joined the cluster hostname:"localhost" port:50010 pgpool_port:50008
> 2023-01-21 20:14:13.542: watchdog pid 598125: DETAIL:  Pgpool-II version:"4.5devel" watchdog messaging version: 1.2
> 2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  The newly joined node:"localhost:50008 Linux tishii-CFSV9-2" had left the cluster because it was lost
> 2023-01-21 20:14:13.542: watchdog pid 598125: DETAIL:  lost reason was "REPORTED BY LIFECHECK" and startup time diff = 0
> 
> As I expected, "localhost:50008" (that is pgpool2) left the cluster. So far so good.
> 
> Then strang thing happend:
> 
> 2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  remote node "localhost:50008 Linux tishii-CFSV9-2" became reachable again
> 2023-01-21 20:14:13.542: watchdog pid 598125: DETAIL:  requesting the node info
> 2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  remote node "localhost:50008 Linux tishii-CFSV9-2" is reporting that it has found us again
> 
> Why pgpool2 came back despite that life check continues to report the
> node is dead? It seems the life check report has been ignored.
> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_test.patch
Type: text/x-patch
Size: 4039 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20230121/a9d12975/attachment.bin>