[pgpool-general: 9094] Re: kind does not match between main(0) slot[0] (52)

Bo Peng pengbo at sraoss.co.jp
Thu May 2 17:23:03 JST 2024


Hi,

Thank your for explaining the test scenario and sharing the confuguration file.

I tried your test scenario, but I could not reproduce this issue.

> * At timestamp 04:25:54: Failover all tasks from node 172.29.30.1 to
> another node (node 2 is the most likely). This consists of first restarting
> pgpool to force it to drop its leadership status. When pgpool is up and in
> sync in the cluster, stop and detach the database to force a failover.

At this time, you stopped node0, then node1 became primary.
It seems that after failover node0 joined cluster again as a standby.
Did you manually restore it as a standby?

I also noticed that PostgreSQL and Pgpool-II are using same port 5432.
(I assume PostgreSQL and Pgpool-II are running on the same server.)

It may not be the cause, if it's possilbe could you try to set pgpool to use a different port?
I also want to check the failover and follow_primary script.
Could you share your pgpool_failover.sh and pgpool_follow_primary.sh?

On Tue, 30 Apr 2024 16:39:34 +0200
Emond Papegaaij <emond.papegaaij at gmail.com> wrote:

> Op di 30 apr 2024 om 07:43 schreef Bo Peng <pengbo at sraoss.co.jp>:
> 
> > Hi,
> >
> > > We've noticed a failure in one of our test runs tonight which I can't
> > > explain. During a reboot test of the nodes in the cluster, one of the
> > > pgpool instances (the one with ip 172.29.30.2) starts returning the
> > > following error:
> > > pid 194: ERROR:  unable to read message kind
> > > pid 194: DETAIL:  kind does not match between main(0) slot[0] (52)
> >
> > The error means the responses from main node and node0 do not match.
> >
> > I checked the logs and the logs show that node0 is down, but pcp_node_info
> > shows "up".
> >
> > Could you share your pgpool.conf of all pgpool nodes and
> > the test scenario?
> >
> 
> This is our  reboot test. It reboots all 3 nodes in the cluster in a
> controlled way:
> * The test starts by restoring a fixed state in a cluster with 3 vms.
> Node 172.29.30.1 will be watchdog leader and run the primary database. The
> other nodes are healthy standby.
> * At timestamp 04:17:32: Node 172.29.30.2 is rebooted first.
> * Wait until node 2 is fully up (this takes about 2:30 minutes after it has
> booted).
> * At timestamp 04:21:55: Node 172.29.30.3 is rebooted next
> * Wait until node 3 is fully up (this again takes about 2:30 minutes after
> it has booted).
> * At timestamp 04:25:54: Failover all tasks from node 172.29.30.1 to
> another node (node 2 is the most likely). This consists of first restarting
> pgpool to force it to drop its leadership status. When pgpool is up and in
> sync in the cluster, stop and detach the database to force a failover.
> * At timestamp 04:26:17: Reboot node 1
> * Wait until all nodes report a fully healthy state.
> 
> As you can see in the log, node 2 starts reporting 'kind does not match' at
> the moment node 1 is in its reboot cycle. The first error is at 04:27:46,
> which matches exactly with the moment pgpool starts back up on node 1. The
> logs from node 1 show pgpool starting and the logs from node 2 show 'new
> watchdog connection' just prior to the first 'kind does not match'.
> 
> I've attached an example pgpool.conf. It's not the exact same version from
> this test, because the test does not export the configuration. All relevant
> settings will be the same, but some names (such as
> backend_application_nameX) will be different. The configuration is
> identical on all nodes, because it is fully managed by configuration
> management.
> 
> Best regards,
> Emond


-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS LLC
TEL: 03-5979-2701 FAX: 03-5979-2702
URL: https://www.sraoss.co.jp/



More information about the pgpool-general mailing list