[pgpool-general: 9167] Re: Query 2 Node HA test case result

Thu Jul 4 15:50:03 JST 2024

[To: pgsql-general at lists.postgresql.org removed]

Please do not cross post discussions on pgpool to
pgsql-general at lists.postgresql.org.

> Hello everyone,
> We are doing a POC on postgres HA setup with streaming replication (async)
> using pgpool-II as a load balancing  & connection pooling and repmgr for
> setting up HA & automatic failover.
> We are applying a test case, like isolating the VM1 node from the Network
> completely for more than 2 mins and again plug-in back the network, since
> we want to verify how the system works during network glitches, any chances
> of split-brain or so.
> Our current setup looks like below,
> 2 VM's on Azure cloud, each VM has Postgres running along with Pgpool
> service.
> [image: image.png]
> 
> We enabled watchdog and assigned a delegate IP
> *NOTE: as per some limitations we are using a floating IP and used for
> delegate IP.*
> 
> During the test, here are our observations:
> 1. Client connections got hung from the time the VM1 got lost from the
> network and till VM1 gets back to normal.
> 2. Once the VM1 is lost then Pgpool promotes the VM2 as LEADER node and
> Postgres Standby got promoted to Primary on VM2 as well, but still client
> connections are not connecting to the new primary. Why is this not
> happening?
> 3. Once the VM1 is back to network, there is a split brain situation, where
> pgpool on VM1 takes the lead to become LEADER node (pgpool.log shows). and
> from then the client connects to the VM1 node via VIP.
> 
> *pgpool.conf *

The cause is that the system could not hold the quorum.

> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  failover command
> [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
> "staging-ha0002:9999 Linux staging-ha0002" is rejected because the watchdog
> cluster does not hold the quorum

Have you enabled enable_consensus_with_half_votes? This is necessary
if you configure even number of watchdog nodes and still want to
succeed in getting quorum.

https://www.pgpool.net/docs/latest/en/html/runtime-watchdog-config.html#CONFIG-WATCHDOG-FAILOVER-BEHAVIOR

As documented in the doc, configuration with even number of watchdog
nodes (including two) could risk split-brain problems and is not
recommended.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> sr_check_period  10sec
> 
> health_check_period  30sec
> 
> health_check_timeout 20 sec
> 
> health_check_max_retries  3
> 
> health_check_retry_delay 1
> 
> wd_lifecheck_method = 'heartbeat'
> 
> wd_interval = 10
> 
> wd_heartbeat_keepalive = 2
> 
> wd_heartbeat_deadtime = 30
> 
> 
> *Logs information: *
> 
>>From VM2:
> 
> Pgpool.log
> 
> 14:30:17  N/w disconnected
> 
> After 10 sec the streaming replication check failed and got timed out.
> 
> 2024-07-03 14:30:26.176: sr_check_worker pid 58187: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
> 
> 
> 
> Then pgpool failed to do health check since it got timed out as per
> health_check_timeout set to 20 sec
> 
> 2024-07-03 14:30:35.869: health_check0 pid 58188: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
> 
> 
> 
> Re-trying health_check  & sr_check but again timed out.
> 
> 
> 
> 2024-07-03 14:30:46.187: sr_check_worker pid 58187: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
> 
> 2024-07-03 14:30:46.880: health_check0 pid 58188: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
> 
> 
> 
> Watchdog received a message saying the Leader node is lost.
> 
> 
> 
> 2024-07-03 14:30:47.192: watchdog pid 58151: WARNING:  we have not received
> a beacon message from leader node "staging-ha0001:9999 Linux staging-ha0001"
> 
> 2024-07-03 14:30:47.192: watchdog pid 58151: DETAIL:  requesting info
> message from leader node
> 
> 2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  read from socket failed,
> remote end closed the connection
> 
> 2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  client socket of
> staging-ha0001:9999 Linux staging-ha0001 is closed
> 
> 2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  remote node
> "staging-ha0001:9999 Linux staging-ha0001" is reporting that it has lost us
> 
> 2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  we are lost on the
> leader node "staging-ha0001:9999 Linux staging-ha0001"
> 
> 
> 
> Re-trying health_check  & sr_check but again timed out.
> 
> 
> 
> 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
> 
> 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  health check
> retrying on DB node: 0 (round:3)
> 
> 2024-07-03 14:31:06.201: sr_check_worker pid 58187: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
> 
> 
> 
> 
> 
> After 10 sec from the time we lost the leader node,  watchdog changed
> current node to LEADER node
> 
> 2024-07-03 14:31:04.199: watchdog pid 58151: LOG:  watchdog node state
> changed from [STANDING FOR LEADER] to [LEADER]
> 
> 
> 
> 
> 
> health_check is failed on node 0 and it received a degenerated request for
> node 0  and the pgpool main process started quarantining
> staging-ha0001(5432) (shutting down)
> 
> 
> 
> 2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  setting the local node
> "staging-ha0002:9999 Linux staging-ha0002" as watchdog cluster leader
> 
> 2024-07-03 14:31:08.202: watchdog pid 58151: LOG:
> signal_user1_to_parent_with_reason(1)
> 
> 2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  I am the cluster leader
> node but we do not have enough nodes in cluster
> 
> 2024-07-03 14:31:08.202: watchdog pid 58151: DETAIL:  waiting for the
> quorum to start escalation process
> 
> 2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process
> received SIGUSR1
> 
> 2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process
> received watchdog state change signal from watchdog
> 
> 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
> 
> 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  health check failed
> on node 0 (timeout:0)
> 
> 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  received degenerate
> backend request for node_id: 0 from pid [58188]
> 
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  watchdog received the
> failover command from local pgpool-II on IPC interface
> 
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  watchdog is processing
> the failover command [DEGENERATE_BACKEND_REQUEST] received from local
> pgpool-II on IPC interface
> 
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  failover requires the
> quorum to hold, which is not present at the moment
> 
> 2024-07-03 14:31:08.899: watchdog pid 58151: DETAIL:  Rejecting the
> failover request
> 
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  failover command
> [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
> "staging-ha0002:9999 Linux staging-ha0002" is rejected because the watchdog
> cluster does not hold the quorum
> 
> 2024-07-03 14:31:08.900: health_check0 pid 58188: LOG:  degenerate backend
> request for 1 node(s) from pid [58188], is changed to quarantine node
> request by watchdog
> 
> 2024-07-03 14:31:08.900: health_check0 pid 58188: DETAIL:  watchdog does
> not holds the quorum
> 
> 2024-07-03 14:31:08.900: health_check0 pid 58188: LOG:
> signal_user1_to_parent_with_reason(0)
> 
> 2024-07-03 14:31:08.900: main pid 58147: LOG:  Pgpool-II parent process
> received SIGUSR1
> 
> 2024-07-03 14:31:08.900: main pid 58147: LOG:  Pgpool-II parent process has
> received failover request
> 
> 2024-07-03 14:31:08.900: watchdog pid 58151: LOG:  received the failover
> indication from Pgpool-II on IPC interface
> 
> 2024-07-03 14:31:08.900: watchdog pid 58151: LOG:  watchdog is informed of
> failover start by the main process
> 
> 2024-07-03 14:31:08.900: main pid 58147: LOG:  === Starting quarantine.
> shutdown host staging-ha0001(5432) ===
> 
> 2024-07-03 14:31:08.900: main pid 58147: LOG:  Restart all children
> 
> 2024-07-03 14:31:08.900: main pid 58147: LOG:  failover: set new primary
> node: -1
> 
> 2024-07-03 14:31:08.900: main pid 58147: LOG:  failover: set new main node:
> 1
> 
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: ERROR:  Failed to check
> replication time lag
> 
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: DETAIL:  No persistent
> db connection for the node 0
> 
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: HINT:  check
> sr_check_user and sr_check_password
> 
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: CONTEXT:  while
> checking replication time lag
> 
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: LOG:  worker process
> received restart request
> 
> 2024-07-03 14:31:08.906: watchdog pid 58151: LOG:  received the failover
> indication from Pgpool-II on IPC interface
> 
> 2024-07-03 14:31:08.906: watchdog pid 58151: LOG:  watchdog is informed of
> failover end by the main process
> 
> 2024-07-03 14:31:08.906: main pid 58147: LOG:  === Quarantine done.
> shutdown host staging-ha0001(5432) ===
> 
> 2024-07-03 14:31:09.906: pcp_main pid 58186: LOG:  restart request received
> in pcp child process
> 
> 2024-07-03 14:31:09.907: main pid 58147: LOG:  PCP child 58186 exits with
> status 0 in failover()
> 
> 2024-07-03 14:31:09.908: main pid 58147: LOG:  fork a new PCP child pid
> 58578 in failover()
> 
> 2024-07-03 14:31:09.908: main pid 58147: LOG:  reaper handler
> 
> 2024-07-03 14:31:09.908: pcp_main pid 58578: LOG:  PCP process: 58578
> started
> 
> 2024-07-03 14:31:09.909: main pid 58147: LOG:  reaper handler: exiting
> normally
> 
> 2024-07-03 14:31:09.909: sr_check_worker pid 58579: LOG:  process started
> 
> 2024-07-03 14:31:19.915: watchdog pid 58151: LOG:  not able to send
> messages to remote node "staging-ha0001:9999 Linux staging-ha0001"
> 
> 2024-07-03 14:31:19.915: watchdog pid 58151: DETAIL:  marking the node as
> lost
> 
> 2024-07-03 14:31:19.915: watchdog pid 58151: LOG:  remote node
> "staging-ha0001:9999 Linux staging-ha0001" is lost
> 
> 
> 
> 
> 
> 
> 
>>From VM1:
> 
> *pgpool.log*
> 
> 2024-07-03 14:30:36.444: watchdog pid 8620: LOG:  remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons
> 
> 2024-07-03 14:30:36.444: watchdog pid 8620: DETAIL:  missed beacon reply
> count:2
> 
> 2024-07-03 14:30:37.448: sr_check_worker pid 65605: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0002:5432", timed out
> 
> 2024-07-03 14:30:46.067: health_check1 pid 8676: LOG:  failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
> 
> 2024-07-03 14:30:46.068: health_check1 pid 8676: LOG:  health check
> retrying on DB node: 1 (round:1)
> 
> 2024-07-03 14:30:46.455: watchdog pid 8620: LOG:  remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons
> 
> 2024-07-03 14:30:46.455: watchdog pid 8620: DETAIL:  missed beacon reply
> count:3
> 
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: ERROR:  Failed to check
> replication time lag
> 
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: DETAIL:  No persistent
> db connection for the node 1
> 
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: HINT:  check
> sr_check_user and sr_check_password
> 
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: CONTEXT:  while
> checking replication time lag
> 
> 2024-07-03 14:30:55.104: child pid 65509: LOG:  failover or failback event
> detected
> 
> 2024-07-03 14:30:55.104: child pid 65509: DETAIL:  restarting myself
> 
> 2024-07-03 14:30:55.104: main pid 8617: LOG:  reaper handler
> 
> 2024-07-03 14:30:55.105: main pid 8617: LOG:  reaper handler: exiting
> normally
> 
> 2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons
> 
> 2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL:  missed beacon reply
> count:4
> 
> 2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not responding to our beacon
> messages
> 
> 2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL:  marking the node as
> lost
> 
> 2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node
> "staging-ha0002:9999 Linux staging-ha0002" is lost
> 
> 2024-07-03 14:30:56.460: watchdog pid 8620: LOG:  removing watchdog node
> "staging-ha0002:9999 Linux staging-ha0002" from the standby list
> 
> 2024-07-03 14:30:56.460: watchdog pid 8620: LOG:  We have lost the quorum
> 
> 2024-07-03 14:30:56.460: watchdog pid 8620: LOG:
> signal_user1_to_parent_with_reason(3)
> 
> 2024-07-03 14:30:56.460: main pid 8617: LOG:  Pgpool-II parent process
> received SIGUSR1
> 
> 2024-07-03 14:30:56.460: main pid 8617: LOG:  Pgpool-II parent process
> received watchdog quorum change signal from watchdog
> 
> 2024-07-03 14:30:56.461: watchdog_utility pid 66197: LOG:  watchdog:
> de-escalation started
> 
> sudo: a terminal is required to read the password; either use the -S option
> to read from standard input or configure an askpass helper
> 
> 2024-07-03 14:30:57.078: health_check1 pid 8676: LOG:  failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
> 
> 2024-07-03 14:30:57.078: health_check1 pid 8676: LOG:  health check
> retrying on DB node: 1 (round:2)
> 
> 2024-07-03 14:30:57.418: life_check pid 8639: LOG:  informing the node
> status change to watchdog
> 
> 2024-07-03 14:30:57.418: life_check pid 8639: DETAIL:  node id :1 status =
> "NODE DEAD" message:"No heartbeat signal from node"
> 
> 2024-07-03 14:30:57.418: watchdog pid 8620: LOG:  received node status
> change ipc message
> 
> 2024-07-03 14:30:57.418: watchdog pid 8620: DETAIL:  No heartbeat signal
> from node
> 
> 2024-07-03 14:30:57.418: watchdog pid 8620: LOG:  remote node
> "staging-ha0002:9999 Linux staging-ha0002" is lost
> 
> 2024-07-03 14:30:57.464: sr_check_worker pid 65605: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0002:5432", timed out
> 
> sudo: a password is required
> 
> 2024-07-03 14:30:59.301: watchdog_utility pid 66197: LOG:  failed to
> release the delegate IP:"10.127.1.20"
> 
> 2024-07-03 14:30:59.301: watchdog_utility pid 66197: DETAIL:  'if_down_cmd'
> failed
> 
> 2024-07-03 14:30:59.301: watchdog_utility pid 66197: WARNING:  watchdog
> de-escalation failed to bring down delegate IP
> 
> 2024-07-03 14:30:59.301: watchdog pid 8620: LOG:  watchdog de-escalation
> process with pid: 66197 exit with SUCCESS.
> 
> 
> 
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: ERROR:  Failed to check
> replication time lag
> 
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: DETAIL:  No persistent
> db connection for the node 1
> 
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: HINT:  check
> sr_check_user and sr_check_password
> 
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: CONTEXT:  while
> checking replication time lag
> 
> 2024-07-03 14:31:08.089: health_check1 pid 8676: LOG:  failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
> 
> 2024-07-03 14:31:08.089: health_check1 pid 8676: LOG:  health check
> retrying on DB node: 1 (round:3)
> 
> 2024-07-03 14:31:17.480: sr_check_worker pid 65605: LOG:  failed to connect
> to PostgreSQL server on "staging-ha0002:5432", timed out
> 
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
> 
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  health check failed
> on node 1 (timeout:0)
> 
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  received degenerate
> backend request for node_id: 1 from pid [8676]
> 
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  watchdog received the
> failover command from local pgpool-II on IPC interface
> 
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  watchdog is processing
> the failover command [DEGENERATE_BACKEND_REQUEST] received from local
> pgpool-II on IPC interface
> 
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  failover requires the
> quorum to hold, which is not present at the moment
> 
> 2024-07-03 14:31:19.097: watchdog pid 8620: DETAIL:  Rejecting the failover
> request
> 
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  failover command
> [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
> "staging-ha0001:9999 Linux staging-ha0001" is rejected because the watchdog
> cluster does not hold the quorum
> 
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  degenerate backend
> request for 1 node(s) from pid [8676], is changed to quarantine node
> request by watchdog
> 
> 2024-07-03 14:31:19.097: health_check1 pid 8676: DETAIL:  watchdog does not
> holds the quorum
> 
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:
> signal_user1_to_parent_with_reason(0)
> 
> 2024-07-03 14:31:19.097: main pid 8617: LOG:  Pgpool-II parent process
> received SIGUSR1
> 
> 2024-07-03 14:31:19.097: main pid 8617: LOG:  Pgpool-II parent process has
> received failover request
> 
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  received the failover
> indication from Pgpool-II on IPC interface
> 
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  watchdog is informed of
> failover start by the main process
> 
> 2024-07-03 14:31:19.098: main pid 8617: LOG:  === Starting quarantine.
> shutdown host staging-ha0002(5432) ===
> 
> 2024-07-03 14:31:19.098: main pid 8617: LOG:  Do not restart children
> because we are switching over node id 1 host: staging-ha0002 port: 5432 and
> we are in streaming replication mode
> 
> 2024-07-03 14:31:19.098: main pid 8617: LOG:  failover: set new primary
> node: 0
> 
> 2024-07-03 14:31:19.098: main pid 8617: LOG:  failover: set new main node: 0
> 
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: ERROR:  Failed to check
> replication time lag
> 
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: DETAIL:  No persistent
> db connection for the node 1
> 
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: HINT:  check
> sr_check_user and sr_check_password
> 
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: CONTEXT:  while
> checking replication time lag
> 
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: LOG:  worker process
> received restart request
> 
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  received the failover
> indication from Pgpool-II on IPC interface
> 
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  watchdog is informed of
> failover end by the main process
> 
> 2024-07-03 14:31:19.098: main pid 8617: LOG:  === Quarantine done. shutdown
> host staging-ha0002(5432) ==
> 
> 
> 
> 
> 
> 2024-07-03 14:35:59.420: watchdog pid 8620: LOG:  new outbound connection
> to staging-ha0002:9000
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  "staging-ha0001:9999
> Linux staging-ha0001" is the coordinator as per our record but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  cluster is in the
> split-brain
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  I am the coordinator but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  trying to figure out
> the best contender for the leader/coordinator node
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  remote
> node:"staging-ha0002:9999 Linux staging-ha0002" should step down from
> leader because we are the older leader
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  We are in split brain,
> and I am the best candidate for leader/coordinator
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  asking the remote node
> "staging-ha0002:9999 Linux staging-ha0002" to step down
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  we had lost this node
> because of "REPORTED BY LIFECHECK"
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  node will be added to
> cluster once life-check mark it as reachable again
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  "staging-ha0001:9999
> Linux staging-ha0001" is the coordinator as per our record but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
> 
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  cluster is in the
> split-brain
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  I am the coordinator but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  trying to figure out
> the best contender for the leader/coordinator node
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  remote
> node:"staging-ha0002:9999 Linux staging-ha0002" should step down from
> leader because we are the older leader
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  We are in split brain,
> and I am the best candidate for leader/coordinator
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  asking the remote node
> "staging-ha0002:9999 Linux staging-ha0002" to step down
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  we had lost this node
> because of "REPORTED BY LIFECHECK"
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  node will be added to
> cluster once life-check mark it as reachable again
> 
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  remote node
> "staging-ha0002:9999 Linux staging-ha0002" is reporting that it has found
> us again
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  leader/coordinator node
> "staging-ha0002:9999 Linux staging-ha0002" decided to resign from leader,
> probably because of split-brain
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  It was not our
> coordinator/leader anyway. ignoring the message
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  we had lost this node
> because of "REPORTED BY LIFECHECK"
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  node will be added to
> cluster once life-check mark it as reachable again
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  we had lost this node
> because of "REPORTED BY LIFECHECK"
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
> 
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  node will be added to
> cluster once life-check mark it as reachable again
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  we had lost this node
> because of "REPORTED BY LIFECHECK"
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  node will be added to
> cluster once life-check mark it as reachable again
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  we had lost this node
> because of "REPORTED BY LIFECHECK"
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
> 
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  node will be added to
> cluster once life-check mark it as reachable again
> 
> 2024-07-03 14:36:00.213: health_check1 pid 8676: LOG:  failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
> 
> 2024-07-03 14:36:00.213: health_check1 pid 8676: LOG:  health check
> retrying on DB node: 1 (round:3)
> 
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  health check
> retrying on DB node: 1 succeeded
> 
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  received failback
> request for node_id: 1 from pid [8676]
> 
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  failback request
> from pid [8676] is changed to update status request because node_id: 1 was
> quarantined
> 
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:
> signal_user1_to_parent_with_reason(0)
> 
> 2024-07-03 14:36:01.221: main pid 8617: LOG:  Pgpool-II parent process
> received SIGUSR1
> 
> 2024-07-03 14:36:01.221: main pid 8617: LOG:  Pgpool-II parent process has
> received failover request
> 
> 2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  received the failover
> indication from Pgpool-II on IPC interface
> 
> 2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  watchdog is informed of
> failover start by the main process
> 
> 2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  watchdog is informed of
> failover start by the main process
> 
> 2024-07-03 14:36:01.222: main pid 8617: LOG:  === Starting fail back.
> reconnect host staging-ha0002(5432) ===
> 
> 2024-07-03 14:36:01.222: main pid 8617: LOG:  Node 0 is not down (status: 2)
> 
> 2024-07-03 14:36:01.222: main pid 8617: LOG:  Do not restart children
> because we are failing back node id 1 host: staging-ha0002 port: 5432 and
> we are in streaming replication mode and not all backends were down
> 
> 2024-07-03 14:36:01.222: main pid 8617: LOG:  failover: set new primary
> node: 0
> 
> 2024-07-03 14:36:01.222: main pid 8617: LOG:  failover: set new main node: 0
> 
> 2024-07-03 14:36:01.222: sr_check_worker pid 66222: LOG:  worker process
> received restart request
> 
> 2024-07-03 14:36:01.222: watchdog pid 8620: LOG:  received the failover
> indication from Pgpool-II on IPC interface
> 
> 2024-07-03 14:36:01.222: watchdog pid 8620: LOG:  watchdog is informed of
> failover end by the main process
> 
> 2024-07-03 14:36:01.222: main pid 8617: LOG:  === Failback done. reconnect
> host staging-ha0002(5432) ===
> 
> 
> *Questions: *
> 1. From the point 2 in observations, why are the connections not going to
> new primary?
> 2. In this kind of setup will the transaction split happen when there is a
> network glitch?
> 
> If anyone has worked on similar kind of setup, please provide your insights
> about it.
> Thank you
> 
> Regards
> Mukesh