[pgpool-hackers: 3510] Quarantine state in native replication mode is dangerous

Thu Feb 13 08:15:34 JST 2020

Usama,

I think quarantine state in native replication mode could cause data
inconsistency. Below is a step to reproduce the problem.

# create 3 pgpool + 2 PostgreSQL cluster with replication mode. Note
# that Pgpool-II is compiled with HEALTHCHECK_DEBUG=1 and
# WATCHDOG_DEBUG=1 (make HEALTHCHECK_DEBUG=1 WATCHDOG_DEBUG=1)
$ watchdog_setup -wn 3 -n 2 -m r

# start the cluster
./startall
./pcp_watchdog_info -v -p 50001

Watchdog Cluster Information 
Total Nodes          : 3
Remote Nodes         : 2
Quorum state         : QUORUM EXIST
Alive Remote Nodes   : 2
VIP up on local node : YES
Master Node Name     : localhost:50000 Linux tishii-CFSV7-1
Master Host Name     : localhost

Watchdog Node Information 
Node Name      : localhost:50000 Linux tishii-CFSV7-1
Host Name      : localhost
Delegate IP    : Not_Set
Pgpool port    : 50000
Watchdog port  : 50002
Node priority  : 3
Status         : 4
Status Name    : MASTER

Node Name      : localhost:50004 Linux tishii-CFSV7-1
Host Name      : localhost
Delegate IP    : Not_Set
Pgpool port    : 50004
Watchdog port  : 50006
Node priority  : 2
Status         : 7
Status Name    : STANDBY

Node Name      : localhost:50008 Linux tishii-CFSV7-1
Host Name      : localhost
Delegate IP    : Not_Set
Pgpool port    : 50008
Watchdog port  : 50010
Node priority  : 1
Status         : 7
Status Name    : STANDBY

$ psql -p 50000 -c "show pool_nodes" test
 node_id | hostname | port  | status | lb_weight |  role  | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+--------+-----------+--------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 51000 | up     | 0.500000  | master | 0          | true              | 0                 |                   |                        | 2020-02-13 07:58:54
 1       | /tmp     | 51001 | up     | 0.500000  | slave  | 0          | false             | 0                 |                   |                        | 2020-02-13 07:58:54
(2 rows)

# create artificial failure on pgpool0/PostgreSQL node 1
echo "1 down" > pgpool0/log/backend_down_request

# make sure that pgpool0/node 1 goes into quarantine state
$ psql -p 50000 -c "show pool_nodes" test
 node_id | hostname | port  |   status   | lb_weight |  role  | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+------------+-----------+--------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 51000 | up         | 0.500000  | master | 0          | true              | 0                 |                   |                        | 2020-02-13 08:01:37
 1       | /tmp     | 51001 | quarantine | 0.500000  | slave  | 0          | false             | 0                 |                   |                        | 2020-02-13 08:01:40
(2 rows)

# modify database via pgpool0
$ psql -p 50000 test
psql (12.0)
Type "help" for help.

test=# create table t1(i int);
CREATE TABLE
test=# insert into t1 values(1);
INSERT 0 1
test=# \q

# check the database consistenct via pgpool1
$ psql -p 51001 test
psql (12.0)
Type "help" for help.

test=# select * from t1;
 i 
---
 1
(1 row)

test=# \q
t-ishii$ psql -p 51001 test
psql (12.0)
Type "help" for help.

test=# select * from t1;
ERROR:  relation "t1" does not exist
LINE 1: select * from t1;
                      ^

Now node 0 and node 1 are in inconsistent state.

Probably we should not allow setting failover_require_consensus to on
in native replication mode, or at least add a strong warning to not do
so in the doc. What do you think?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp