[pgpool-general: 9167] Re: Query 2 Node HA test case result
Tatsuo Ishii
ishii at sraoss.co.jp
Thu Jul 4 15:50:03 JST 2024
[To: pgsql-general at lists.postgresql.org removed]
Please do not cross post discussions on pgpool to
pgsql-general at lists.postgresql.org.
> Hello everyone,
> We are doing a POC on postgres HA setup with streaming replication (async)
> using pgpool-II as a load balancing & connection pooling and repmgr for
> setting up HA & automatic failover.
> We are applying a test case, like isolating the VM1 node from the Network
> completely for more than 2 mins and again plug-in back the network, since
> we want to verify how the system works during network glitches, any chances
> of split-brain or so.
> Our current setup looks like below,
> 2 VM's on Azure cloud, each VM has Postgres running along with Pgpool
> service.
> [image: image.png]
>
> We enabled watchdog and assigned a delegate IP
> *NOTE: as per some limitations we are using a floating IP and used for
> delegate IP.*
>
> During the test, here are our observations:
> 1. Client connections got hung from the time the VM1 got lost from the
> network and till VM1 gets back to normal.
> 2. Once the VM1 is lost then Pgpool promotes the VM2 as LEADER node and
> Postgres Standby got promoted to Primary on VM2 as well, but still client
> connections are not connecting to the new primary. Why is this not
> happening?
> 3. Once the VM1 is back to network, there is a split brain situation, where
> pgpool on VM1 takes the lead to become LEADER node (pgpool.log shows). and
> from then the client connects to the VM1 node via VIP.
>
> *pgpool.conf *
The cause is that the system could not hold the quorum.
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: failover command
> [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
> "staging-ha0002:9999 Linux staging-ha0002" is rejected because the watchdog
> cluster does not hold the quorum
Have you enabled enable_consensus_with_half_votes? This is necessary
if you configure even number of watchdog nodes and still want to
succeed in getting quorum.
https://www.pgpool.net/docs/latest/en/html/runtime-watchdog-config.html#CONFIG-WATCHDOG-FAILOVER-BEHAVIOR
As documented in the doc, configuration with even number of watchdog
nodes (including two) could risk split-brain problems and is not
recommended.
Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
> sr_check_period 10sec
>
> health_check_period 30sec
>
> health_check_timeout 20 sec
>
> health_check_max_retries 3
>
> health_check_retry_delay 1
>
> wd_lifecheck_method = 'heartbeat'
>
> wd_interval = 10
>
> wd_heartbeat_keepalive = 2
>
> wd_heartbeat_deadtime = 30
>
>
> *Logs information: *
>
>>From VM2:
>
> Pgpool.log
>
> 14:30:17 N/w disconnected
>
> After 10 sec the streaming replication check failed and got timed out.
>
> 2024-07-03 14:30:26.176: sr_check_worker pid 58187: LOG: failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
>
>
>
> Then pgpool failed to do health check since it got timed out as per
> health_check_timeout set to 20 sec
>
> 2024-07-03 14:30:35.869: health_check0 pid 58188: LOG: failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
>
>
>
> Re-trying health_check & sr_check but again timed out.
>
>
>
> 2024-07-03 14:30:46.187: sr_check_worker pid 58187: LOG: failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
>
> 2024-07-03 14:30:46.880: health_check0 pid 58188: LOG: failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
>
>
>
> Watchdog received a message saying the Leader node is lost.
>
>
>
> 2024-07-03 14:30:47.192: watchdog pid 58151: WARNING: we have not received
> a beacon message from leader node "staging-ha0001:9999 Linux staging-ha0001"
>
> 2024-07-03 14:30:47.192: watchdog pid 58151: DETAIL: requesting info
> message from leader node
>
> 2024-07-03 14:30:54.312: watchdog pid 58151: LOG: read from socket failed,
> remote end closed the connection
>
> 2024-07-03 14:30:54.312: watchdog pid 58151: LOG: client socket of
> staging-ha0001:9999 Linux staging-ha0001 is closed
>
> 2024-07-03 14:30:54.313: watchdog pid 58151: LOG: remote node
> "staging-ha0001:9999 Linux staging-ha0001" is reporting that it has lost us
>
> 2024-07-03 14:30:54.313: watchdog pid 58151: LOG: we are lost on the
> leader node "staging-ha0001:9999 Linux staging-ha0001"
>
>
>
> Re-trying health_check & sr_check but again timed out.
>
>
>
> 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG: failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
>
> 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG: health check
> retrying on DB node: 0 (round:3)
>
> 2024-07-03 14:31:06.201: sr_check_worker pid 58187: LOG: failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
>
>
>
>
>
> After 10 sec from the time we lost the leader node, watchdog changed
> current node to LEADER node
>
> 2024-07-03 14:31:04.199: watchdog pid 58151: LOG: watchdog node state
> changed from [STANDING FOR LEADER] to [LEADER]
>
>
>
>
>
> health_check is failed on node 0 and it received a degenerated request for
> node 0 and the pgpool main process started quarantining
> staging-ha0001(5432) (shutting down)
>
>
>
> 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: setting the local node
> "staging-ha0002:9999 Linux staging-ha0002" as watchdog cluster leader
>
> 2024-07-03 14:31:08.202: watchdog pid 58151: LOG:
> signal_user1_to_parent_with_reason(1)
>
> 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: I am the cluster leader
> node but we do not have enough nodes in cluster
>
> 2024-07-03 14:31:08.202: watchdog pid 58151: DETAIL: waiting for the
> quorum to start escalation process
>
> 2024-07-03 14:31:08.202: main pid 58147: LOG: Pgpool-II parent process
> received SIGUSR1
>
> 2024-07-03 14:31:08.202: main pid 58147: LOG: Pgpool-II parent process
> received watchdog state change signal from watchdog
>
> 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: failed to connect
> to PostgreSQL server on "staging-ha0001:5432", timed out
>
> 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: health check failed
> on node 0 (timeout:0)
>
> 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: received degenerate
> backend request for node_id: 0 from pid [58188]
>
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: watchdog received the
> failover command from local pgpool-II on IPC interface
>
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: watchdog is processing
> the failover command [DEGENERATE_BACKEND_REQUEST] received from local
> pgpool-II on IPC interface
>
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: failover requires the
> quorum to hold, which is not present at the moment
>
> 2024-07-03 14:31:08.899: watchdog pid 58151: DETAIL: Rejecting the
> failover request
>
> 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: failover command
> [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
> "staging-ha0002:9999 Linux staging-ha0002" is rejected because the watchdog
> cluster does not hold the quorum
>
> 2024-07-03 14:31:08.900: health_check0 pid 58188: LOG: degenerate backend
> request for 1 node(s) from pid [58188], is changed to quarantine node
> request by watchdog
>
> 2024-07-03 14:31:08.900: health_check0 pid 58188: DETAIL: watchdog does
> not holds the quorum
>
> 2024-07-03 14:31:08.900: health_check0 pid 58188: LOG:
> signal_user1_to_parent_with_reason(0)
>
> 2024-07-03 14:31:08.900: main pid 58147: LOG: Pgpool-II parent process
> received SIGUSR1
>
> 2024-07-03 14:31:08.900: main pid 58147: LOG: Pgpool-II parent process has
> received failover request
>
> 2024-07-03 14:31:08.900: watchdog pid 58151: LOG: received the failover
> indication from Pgpool-II on IPC interface
>
> 2024-07-03 14:31:08.900: watchdog pid 58151: LOG: watchdog is informed of
> failover start by the main process
>
> 2024-07-03 14:31:08.900: main pid 58147: LOG: === Starting quarantine.
> shutdown host staging-ha0001(5432) ===
>
> 2024-07-03 14:31:08.900: main pid 58147: LOG: Restart all children
>
> 2024-07-03 14:31:08.900: main pid 58147: LOG: failover: set new primary
> node: -1
>
> 2024-07-03 14:31:08.900: main pid 58147: LOG: failover: set new main node:
> 1
>
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: ERROR: Failed to check
> replication time lag
>
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: DETAIL: No persistent
> db connection for the node 0
>
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: HINT: check
> sr_check_user and sr_check_password
>
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: CONTEXT: while
> checking replication time lag
>
> 2024-07-03 14:31:08.906: sr_check_worker pid 58187: LOG: worker process
> received restart request
>
> 2024-07-03 14:31:08.906: watchdog pid 58151: LOG: received the failover
> indication from Pgpool-II on IPC interface
>
> 2024-07-03 14:31:08.906: watchdog pid 58151: LOG: watchdog is informed of
> failover end by the main process
>
> 2024-07-03 14:31:08.906: main pid 58147: LOG: === Quarantine done.
> shutdown host staging-ha0001(5432) ===
>
> 2024-07-03 14:31:09.906: pcp_main pid 58186: LOG: restart request received
> in pcp child process
>
> 2024-07-03 14:31:09.907: main pid 58147: LOG: PCP child 58186 exits with
> status 0 in failover()
>
> 2024-07-03 14:31:09.908: main pid 58147: LOG: fork a new PCP child pid
> 58578 in failover()
>
> 2024-07-03 14:31:09.908: main pid 58147: LOG: reaper handler
>
> 2024-07-03 14:31:09.908: pcp_main pid 58578: LOG: PCP process: 58578
> started
>
> 2024-07-03 14:31:09.909: main pid 58147: LOG: reaper handler: exiting
> normally
>
> 2024-07-03 14:31:09.909: sr_check_worker pid 58579: LOG: process started
>
> 2024-07-03 14:31:19.915: watchdog pid 58151: LOG: not able to send
> messages to remote node "staging-ha0001:9999 Linux staging-ha0001"
>
> 2024-07-03 14:31:19.915: watchdog pid 58151: DETAIL: marking the node as
> lost
>
> 2024-07-03 14:31:19.915: watchdog pid 58151: LOG: remote node
> "staging-ha0001:9999 Linux staging-ha0001" is lost
>
>
>
>
>
>
>
>>From VM1:
>
> *pgpool.log*
>
> 2024-07-03 14:30:36.444: watchdog pid 8620: LOG: remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons
>
> 2024-07-03 14:30:36.444: watchdog pid 8620: DETAIL: missed beacon reply
> count:2
>
> 2024-07-03 14:30:37.448: sr_check_worker pid 65605: LOG: failed to connect
> to PostgreSQL server on "staging-ha0002:5432", timed out
>
> 2024-07-03 14:30:46.067: health_check1 pid 8676: LOG: failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
>
> 2024-07-03 14:30:46.068: health_check1 pid 8676: LOG: health check
> retrying on DB node: 1 (round:1)
>
> 2024-07-03 14:30:46.455: watchdog pid 8620: LOG: remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons
>
> 2024-07-03 14:30:46.455: watchdog pid 8620: DETAIL: missed beacon reply
> count:3
>
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: ERROR: Failed to check
> replication time lag
>
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: DETAIL: No persistent
> db connection for the node 1
>
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: HINT: check
> sr_check_user and sr_check_password
>
> 2024-07-03 14:30:47.449: sr_check_worker pid 65605: CONTEXT: while
> checking replication time lag
>
> 2024-07-03 14:30:55.104: child pid 65509: LOG: failover or failback event
> detected
>
> 2024-07-03 14:30:55.104: child pid 65509: DETAIL: restarting myself
>
> 2024-07-03 14:30:55.104: main pid 8617: LOG: reaper handler
>
> 2024-07-03 14:30:55.105: main pid 8617: LOG: reaper handler: exiting
> normally
>
> 2024-07-03 14:30:56.459: watchdog pid 8620: LOG: remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons
>
> 2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL: missed beacon reply
> count:4
>
> 2024-07-03 14:30:56.459: watchdog pid 8620: LOG: remote node
> "staging-ha0002:9999 Linux staging-ha0002" is not responding to our beacon
> messages
>
> 2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL: marking the node as
> lost
>
> 2024-07-03 14:30:56.459: watchdog pid 8620: LOG: remote node
> "staging-ha0002:9999 Linux staging-ha0002" is lost
>
> 2024-07-03 14:30:56.460: watchdog pid 8620: LOG: removing watchdog node
> "staging-ha0002:9999 Linux staging-ha0002" from the standby list
>
> 2024-07-03 14:30:56.460: watchdog pid 8620: LOG: We have lost the quorum
>
> 2024-07-03 14:30:56.460: watchdog pid 8620: LOG:
> signal_user1_to_parent_with_reason(3)
>
> 2024-07-03 14:30:56.460: main pid 8617: LOG: Pgpool-II parent process
> received SIGUSR1
>
> 2024-07-03 14:30:56.460: main pid 8617: LOG: Pgpool-II parent process
> received watchdog quorum change signal from watchdog
>
> 2024-07-03 14:30:56.461: watchdog_utility pid 66197: LOG: watchdog:
> de-escalation started
>
> sudo: a terminal is required to read the password; either use the -S option
> to read from standard input or configure an askpass helper
>
> 2024-07-03 14:30:57.078: health_check1 pid 8676: LOG: failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
>
> 2024-07-03 14:30:57.078: health_check1 pid 8676: LOG: health check
> retrying on DB node: 1 (round:2)
>
> 2024-07-03 14:30:57.418: life_check pid 8639: LOG: informing the node
> status change to watchdog
>
> 2024-07-03 14:30:57.418: life_check pid 8639: DETAIL: node id :1 status =
> "NODE DEAD" message:"No heartbeat signal from node"
>
> 2024-07-03 14:30:57.418: watchdog pid 8620: LOG: received node status
> change ipc message
>
> 2024-07-03 14:30:57.418: watchdog pid 8620: DETAIL: No heartbeat signal
> from node
>
> 2024-07-03 14:30:57.418: watchdog pid 8620: LOG: remote node
> "staging-ha0002:9999 Linux staging-ha0002" is lost
>
> 2024-07-03 14:30:57.464: sr_check_worker pid 65605: LOG: failed to connect
> to PostgreSQL server on "staging-ha0002:5432", timed out
>
> sudo: a password is required
>
> 2024-07-03 14:30:59.301: watchdog_utility pid 66197: LOG: failed to
> release the delegate IP:"10.127.1.20"
>
> 2024-07-03 14:30:59.301: watchdog_utility pid 66197: DETAIL: 'if_down_cmd'
> failed
>
> 2024-07-03 14:30:59.301: watchdog_utility pid 66197: WARNING: watchdog
> de-escalation failed to bring down delegate IP
>
> 2024-07-03 14:30:59.301: watchdog pid 8620: LOG: watchdog de-escalation
> process with pid: 66197 exit with SUCCESS.
>
>
>
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: ERROR: Failed to check
> replication time lag
>
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: DETAIL: No persistent
> db connection for the node 1
>
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: HINT: check
> sr_check_user and sr_check_password
>
> 2024-07-03 14:31:07.465: sr_check_worker pid 65605: CONTEXT: while
> checking replication time lag
>
> 2024-07-03 14:31:08.089: health_check1 pid 8676: LOG: failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
>
> 2024-07-03 14:31:08.089: health_check1 pid 8676: LOG: health check
> retrying on DB node: 1 (round:3)
>
> 2024-07-03 14:31:17.480: sr_check_worker pid 65605: LOG: failed to connect
> to PostgreSQL server on "staging-ha0002:5432", timed out
>
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
>
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: health check failed
> on node 1 (timeout:0)
>
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: received degenerate
> backend request for node_id: 1 from pid [8676]
>
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: watchdog received the
> failover command from local pgpool-II on IPC interface
>
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: watchdog is processing
> the failover command [DEGENERATE_BACKEND_REQUEST] received from local
> pgpool-II on IPC interface
>
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: failover requires the
> quorum to hold, which is not present at the moment
>
> 2024-07-03 14:31:19.097: watchdog pid 8620: DETAIL: Rejecting the failover
> request
>
> 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: failover command
> [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
> "staging-ha0001:9999 Linux staging-ha0001" is rejected because the watchdog
> cluster does not hold the quorum
>
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: degenerate backend
> request for 1 node(s) from pid [8676], is changed to quarantine node
> request by watchdog
>
> 2024-07-03 14:31:19.097: health_check1 pid 8676: DETAIL: watchdog does not
> holds the quorum
>
> 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:
> signal_user1_to_parent_with_reason(0)
>
> 2024-07-03 14:31:19.097: main pid 8617: LOG: Pgpool-II parent process
> received SIGUSR1
>
> 2024-07-03 14:31:19.097: main pid 8617: LOG: Pgpool-II parent process has
> received failover request
>
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: received the failover
> indication from Pgpool-II on IPC interface
>
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: watchdog is informed of
> failover start by the main process
>
> 2024-07-03 14:31:19.098: main pid 8617: LOG: === Starting quarantine.
> shutdown host staging-ha0002(5432) ===
>
> 2024-07-03 14:31:19.098: main pid 8617: LOG: Do not restart children
> because we are switching over node id 1 host: staging-ha0002 port: 5432 and
> we are in streaming replication mode
>
> 2024-07-03 14:31:19.098: main pid 8617: LOG: failover: set new primary
> node: 0
>
> 2024-07-03 14:31:19.098: main pid 8617: LOG: failover: set new main node: 0
>
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: ERROR: Failed to check
> replication time lag
>
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: DETAIL: No persistent
> db connection for the node 1
>
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: HINT: check
> sr_check_user and sr_check_password
>
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: CONTEXT: while
> checking replication time lag
>
> 2024-07-03 14:31:19.098: sr_check_worker pid 65605: LOG: worker process
> received restart request
>
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: received the failover
> indication from Pgpool-II on IPC interface
>
> 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: watchdog is informed of
> failover end by the main process
>
> 2024-07-03 14:31:19.098: main pid 8617: LOG: === Quarantine done. shutdown
> host staging-ha0002(5432) ==
>
>
>
>
>
> 2024-07-03 14:35:59.420: watchdog pid 8620: LOG: new outbound connection
> to staging-ha0002:9000
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: "staging-ha0001:9999
> Linux staging-ha0001" is the coordinator as per our record but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: cluster is in the
> split-brain
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: I am the coordinator but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: trying to figure out
> the best contender for the leader/coordinator node
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: remote
> node:"staging-ha0002:9999 Linux staging-ha0002" should step down from
> leader because we are the older leader
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: We are in split brain,
> and I am the best candidate for leader/coordinator
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: asking the remote node
> "staging-ha0002:9999 Linux staging-ha0002" to step down
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: we had lost this node
> because of "REPORTED BY LIFECHECK"
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: node will be added to
> cluster once life-check mark it as reachable again
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: "staging-ha0001:9999
> Linux staging-ha0001" is the coordinator as per our record but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
>
> 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: cluster is in the
> split-brain
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: I am the coordinator but
> "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
> coordinator
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: trying to figure out
> the best contender for the leader/coordinator node
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: remote
> node:"staging-ha0002:9999 Linux staging-ha0002" should step down from
> leader because we are the older leader
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: We are in split brain,
> and I am the best candidate for leader/coordinator
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: asking the remote node
> "staging-ha0002:9999 Linux staging-ha0002" to step down
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: we had lost this node
> because of "REPORTED BY LIFECHECK"
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: node will be added to
> cluster once life-check mark it as reachable again
>
> 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: remote node
> "staging-ha0002:9999 Linux staging-ha0002" is reporting that it has found
> us again
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: leader/coordinator node
> "staging-ha0002:9999 Linux staging-ha0002" decided to resign from leader,
> probably because of split-brain
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: It was not our
> coordinator/leader anyway. ignoring the message
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: we had lost this node
> because of "REPORTED BY LIFECHECK"
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: node will be added to
> cluster once life-check mark it as reachable again
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: we had lost this node
> because of "REPORTED BY LIFECHECK"
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
>
> 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: node will be added to
> cluster once life-check mark it as reachable again
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: we had lost this node
> because of "REPORTED BY LIFECHECK"
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: node will be added to
> cluster once life-check mark it as reachable again
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: we have received the NODE
> INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
> was lost
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: we had lost this node
> because of "REPORTED BY LIFECHECK"
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: node:"staging-ha0002:9999
> Linux staging-ha0002" was reported lost by the life-check process
>
> 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: node will be added to
> cluster once life-check mark it as reachable again
>
> 2024-07-03 14:36:00.213: health_check1 pid 8676: LOG: failed to connect to
> PostgreSQL server on "staging-ha0002:5432", timed out
>
> 2024-07-03 14:36:00.213: health_check1 pid 8676: LOG: health check
> retrying on DB node: 1 (round:3)
>
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG: health check
> retrying on DB node: 1 succeeded
>
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG: received failback
> request for node_id: 1 from pid [8676]
>
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG: failback request
> from pid [8676] is changed to update status request because node_id: 1 was
> quarantined
>
> 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:
> signal_user1_to_parent_with_reason(0)
>
> 2024-07-03 14:36:01.221: main pid 8617: LOG: Pgpool-II parent process
> received SIGUSR1
>
> 2024-07-03 14:36:01.221: main pid 8617: LOG: Pgpool-II parent process has
> received failover request
>
> 2024-07-03 14:36:01.221: watchdog pid 8620: LOG: received the failover
> indication from Pgpool-II on IPC interface
>
> 2024-07-03 14:36:01.221: watchdog pid 8620: LOG: watchdog is informed of
> failover start by the main process
>
> 2024-07-03 14:36:01.221: watchdog pid 8620: LOG: watchdog is informed of
> failover start by the main process
>
> 2024-07-03 14:36:01.222: main pid 8617: LOG: === Starting fail back.
> reconnect host staging-ha0002(5432) ===
>
> 2024-07-03 14:36:01.222: main pid 8617: LOG: Node 0 is not down (status: 2)
>
> 2024-07-03 14:36:01.222: main pid 8617: LOG: Do not restart children
> because we are failing back node id 1 host: staging-ha0002 port: 5432 and
> we are in streaming replication mode and not all backends were down
>
> 2024-07-03 14:36:01.222: main pid 8617: LOG: failover: set new primary
> node: 0
>
> 2024-07-03 14:36:01.222: main pid 8617: LOG: failover: set new main node: 0
>
> 2024-07-03 14:36:01.222: sr_check_worker pid 66222: LOG: worker process
> received restart request
>
> 2024-07-03 14:36:01.222: watchdog pid 8620: LOG: received the failover
> indication from Pgpool-II on IPC interface
>
> 2024-07-03 14:36:01.222: watchdog pid 8620: LOG: watchdog is informed of
> failover end by the main process
>
> 2024-07-03 14:36:01.222: main pid 8617: LOG: === Failback done. reconnect
> host staging-ha0002(5432) ===
>
>
> *Questions: *
> 1. From the point 2 in observations, why are the connections not going to
> new primary?
> 2. In this kind of setup will the transaction split happen when there is a
> network glitch?
>
> If anyone has worked on similar kind of setup, please provide your insights
> about it.
> Thank you
>
> Regards
> Mukesh
More information about the pgpool-general
mailing list