<div dir="ltr"><div dir="ltr">Op di 30 apr 2024 om 07:43 schreef Bo Peng <<a href="mailto:pengbo@sraoss.co.jp">pengbo@sraoss.co.jp</a>>:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

> We've noticed a failure in one of our test runs tonight which I can't<br>

> explain. During a reboot test of the nodes in the cluster, one of the<br>

> pgpool instances (the one with ip 172.29.30.2) starts returning the<br>

> following error:<br>

> pid 194: ERROR:  unable to read message kind<br>

> pid 194: DETAIL:  kind does not match between main(0) slot[0] (52)<br>

<br>

The error means the responses from main node and node0 do not match.<br>

<br>

I checked the logs and the logs show that node0 is down, but pcp_node_info shows "up".<br>

<br>

Could you share your pgpool.conf of all pgpool nodes and<br>

the test scenario?<br></blockquote><div><br></div><div>This is our  reboot test. It reboots all 3 nodes in the cluster in a controlled way:</div><div>* The test starts by restoring a fixed state in a cluster with 3 vms. Node 172.29.30.1 will be watchdog leader and run the primary database. The other nodes are healthy standby.</div><div>* At timestamp 04:17:32: Node 172.29.30.2 is rebooted first.</div><div>* Wait until node 2 is fully up (this takes about 2:30 minutes after it has booted).</div><div>* At timestamp 04:21:55: Node 172.29.30.3 is rebooted next<br></div><div>* Wait until node 3 is fully up (this again takes about 2:30 minutes after it has booted).<br></div><div>* At timestamp 04:25:54: Failover all tasks from node 172.29.30.1 to another node (node 2 is the most likely). This consists of first restarting pgpool to force it to drop its leadership status. When pgpool is up and in sync in the cluster, stop and detach the database to force a failover.</div><div>* At timestamp 04:26:17: Reboot node 1</div><div>* Wait until all nodes report a fully healthy state.</div><div><br></div><div>As you can see in the log, node 2 starts reporting 'kind does not match' at the moment node 1 is in its reboot cycle. The first error is at 04:27:46, which matches exactly with the moment pgpool starts back up on node 1. The logs from node 1 show pgpool starting and the logs from node 2 show 'new watchdog connection' just prior to the first 'kind does not match'.</div><div><br></div><div>I've attached an example pgpool.conf. It's not the exact same version from this test, because the test does not export the configuration. All relevant settings will be the same, but some names (such as backend_application_nameX) will be different. The configuration is identical on all nodes, because it is fully managed by configuration management.</div><div><br></div><div>Best regards,</div><div>Emond</div></div></div>