[pgpool-general: 9095] Re: kind does not match between main(0) slot[0] (52)

Thu May 2 23:02:35 JST 2024

Op do 2 mei 2024 om 10:23 schreef Bo Peng <pengbo at sraoss.co.jp>:

> Thank your for explaining the test scenario and sharing the confuguration
> file.
>
> I tried your test scenario, but I could not reproduce this issue.
>

The failure is not easy to reproduce. So far, I've only seen it once, and
we run these tests many times a week. I've got the feeling its a race
condition that only manifests itself under very specific circumstances.

> > * At timestamp 04:25:54: Failover all tasks from node 172.29.30.1 to
> > another node (node 2 is the most likely). This consists of first
> restarting
> > pgpool to force it to drop its leadership status. When pgpool is up and
> in
> > sync in the cluster, stop and detach the database to force a failover.
>
> At this time, you stopped node0, then node1 became primary.
> It seems that after failover node0 joined cluster again as a standby.
> Did you manually restore it as a standby?
>

By node0, you mean the backend? We restart the VM. This means that
PostgreSQL restarts in the configuration it was in when it was shutdown. So
in this case, node0 is a standby for node1 and it will reconnect to node1
and resume streaming replication. Looking at the state it reported, this
seems to work fine.

For Pgpool, we restart in a clean state. We remove the state file and let
pgpool get the information from the other nodes in the cluster. As node1 is
the leader, its status should be what's sent to node0. It could very well
be that this is where the problem lies. The database and pgpool start at
about the same time. So this state transfer happens at the same moment when
the database on node0 resumes its streaming replication and is
automatically re-attached.

> I also noticed that PostgreSQL and Pgpool-II are using same port 5432.
> (I assume PostgreSQL and Pgpool-II are running on the same server.)
>

We run these services in docker containers, so the port numbers are on the
docker interface. Both services do each have their own ip address on which
they bind.

It may not be the cause, if it's possilbe could you try to set pgpool to
> use a different port?
> I also want to check the failover and follow_primary script.
> Could you share your pgpool_failover.sh and pgpool_follow_primary.sh?
>

These scripts are largely based on the examples, but highly modified for
our specific needs. We do not allow direct communication with ssh from
inside the pgpool docker container, so we instead use a special webhook
service that pgpool can send its commands to. This does result in the
scripts being split up in many more smaller scripts which get executed via
this webhook service.

Looking at the status reported by pgpool (and the status reported by our
own tooling), it seems PostgreSQL is running correctly on all three nodes.
node0 and node2 are standby for node1 using streaming replication and in
sync. This is also what's reported by the queries on the database.

Best regards,
Emond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240502/61efc548/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool_follow_primary.sh
Type: application/x-shellscript
Size: 2407 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240502/61efc548/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool_failover.sh
Type: application/x-shellscript
Size: 1146 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240502/61efc548/attachment-0001.bin>