[pgpool-hackers: 4486] Pgpool-4.2 doesn't see primary node of Postgresql15 'stream' cluster

Sun Jul 14 21:51:18 JST 2024

Hi guys
I have given up and it looks like I need your help.

A setup:
I have a setup with 2 hosts of Postgresql15 stream cluster (IPs:
10.65.188.55, 10.65.188.56 respectively) with a 2 Pgpool 4.2.15 instances
installed in the same hosts.
The pgpool instances manage/delegate on a VIP 10.65.188.59.

A replication between pg15 instances work well and the instances recovery
status detected reliably:

[root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
fabrix -h 10.65.188.55 -p 5432 template1 -c 'select pg_is_in_recovery()'
 pg_is_in_recovery
-------------------
 f
(1 row)

[root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
fabrix -h 10.65.188.56 -p 5432 template1 -c 'select pg_is_in_recovery()'
 pg_is_in_recovery
-------------------
 t
(1 row)

[root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
fabrix -h 10.65.188.55 -p 5432 template1 -c 'select * from
pg_stat_replication'
  pid  | usesysid | usename | application_name | client_addr  |
client_hostname | client_port |         backend_start         |
backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn |
write_lag | f
lush_lag | replay_lag | sync_priority | sync_state |          reply_time
-------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+--
---------+------------+---------------+------------+-------------------------------
 13875 |    16384 | fabrix  | walreceiver      | 10.65.188.56 |
    |       47636 | 2024-07-14 14:30:54.344682+03 |          743 |
streaming | 0/B0001C0 | 0/B0001C0 | 0/B0001C0 | 0/B0001C0  |           |
         |            |             0 | async      | 2024-07-14
14:48:41.550424+03
(1 row)

[root at pg-mgrdb1 pgpool-II]#

The databases of the pg15 cluster are reachable/available via VIP:

[root at pg-mgrdb1 pgpool-II]#
[root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
fabrix -h 10.65.188.59 -p 9999 manager -c '\dt'
                 List of relations
 Schema |         Name          | Type  |  Owner
--------+-----------------------+-------+----------
 public | upgrade_version_table | table | postgres
(1 row)

[root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
fabrix -h 10.65.188.59 -p 9999 manager -c 'select * from
upgrade_version_table'
    app_name    | cur_version | upgrade_to
----------------+-------------+------------
 brODtv Manager | 10.2.0.300  |
(1 row)

[root at pg-mgrdb1 pgpool-II]#

But, the pgpool cannot detect the cluster primary node, and provides
incorrect pool_nodes info:

[root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
fabrix -h 10.65.188.59 -p 9999 manager -c 'show pool_nodes'
 node_id |   hostname   | port | status | lb_weight |  role   | select_cnt
| load_balance_node | replication_delay | replication_state |
replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 10.65.188.55 | 5432 | down   | 0.500000  | standby | 0
 | false             | 0                 |                   |
           | 2024-07-14 15:04:01
 1       | 10.65.188.56 | 5432 | up     | 0.500000  | standby | 3
 | true              | 0                 |                   |
           | 2024-07-14 15:04:01
(2 rows)

[root at pg-mgrdb1 pgpool-II]#

Right now, I'm testing it with a "single pgpool up" configuration (pgpool
on the second host is down). The  pgpool log shows that the pgpool doesn't
even try to touch the first (current primary) backend at 10.65.188.55:5432.
It successfully connects to the second 'standby' node and detects its
status. And this second node handles successfully all SELECT queries due to
load balancing configured. But what's wrong with first primary node?

In the pgpool log I don't see any tries to connect to the first node. The
piece of the log where "find_primary_node" mentioned first looks so:

2024-07-14 15:23:38: pid 18723: LOG:  2 watchdog nodes are configured for
lifecheck
2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:495
2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:0 Name:"
10.65.188.55:9999 Linux pg-mgrdb1"
2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.55" WD Port:9000
pgpool-II port:9999
2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:1 Name:"Not_Set"
2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.56" WD Port:9000
pgpool-II port:9999
2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node_repeatedly: waiting
for finding a primary node
2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3404
2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node: standby node is 1
2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3329
2024-07-14 15:23:39: pid 18729: LOG:  set SO_REUSEPORT option to the socket
2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:691
2024-07-14 15:23:39: pid 18729: LOG:  creating watchdog heartbeat receive
socket.
2024-07-14 15:23:39: pid 18729: DETAIL:  set SO_REUSEPORT
2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:231
2024-07-14 15:23:39: pid 18731: LOG:  set SO_REUSEPORT option to the socket
2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:691
2024-07-14 15:23:39: pid 18731: LOG:  creating socket for sending heartbeat
2024-07-14 15:23:39: pid 18731: DETAIL:  set SO_REUSEPORT
2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:148
2024-07-14 15:23:39: pid 18706: LOG:  find_primary_node: standby node is 1
2024-07-14 15:23:39: pid 18706: LOCATION:  pgpool_main.c:3329
2024-07-14 15:23:40: pid 18706: LOG:  find_primary_node: standby node is 1
2024-07-14 15:23:40: pid 18706: LOCATION:  pgpool_main.c:3329
2024-07-14 15:23:41: pid 18706: LOG:  find_primary_node: standby node is 1
2024-07-14 15:23:41: pid 18706: LOCATION:  pgpool_main.c:3329

So I have no ideas anymore.

The cleaned from comments and  sorted pgpool.conf looks so:
*********************************************************************************************
allow_clear_text_frontend_auth=off
allow_multiple_failover_requests_from_node=off
allow_sql_comments=off
app_name_redirect_preference_list=''
arping_cmd='/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'
arping_path='/usr/sbin'
authentication_timeout=1min
auto_failback_interval=1min
auto_failback=off
backend_application_name0=mgrdb1
backend_application_name1=mgrdb2
backend_clustering_mode='streaming_replication'
backend_data_directory0='/home/postgres/databases/fabrix2'
backend_data_directory1='/home/postgres/databases/fabrix2'
backend_flag0='ALLOW_TO_FAILOVER'
backend_flag1='ALLOW_TO_FAILOVER'
backend_hostname0='10.65.188.55'
backend_hostname1='10.65.188.56'
backend_port0=5432
backend_port1=5432
backend_weight0=1
backend_weight1=1
black_function_list='currval,lastval,nextval,setval'
cache_safe_memqcache_table_list=''
cache_unsafe_memqcache_table_list=''
check_temp_table=catalog
check_unlogged_table=on
child_life_time=1
child_max_connections=0
clear_memqcache_on_escalation=on
client_idle_limit=0
client_idle_limit_in_recovery=0
client_min_messages = debug1
connection_cache=on
connection_life_time=0
connect_timeout=10000
database_redirect_preference_list=''
delay_threshold=10000000
delegate_IP='10.65.188.59'
detach_false_primary=off
disable_load_balance_on_write='transaction'
dml_adaptive_object_relationship_list=''
enable_consensus_with_half_votes=on
enable_pool_hba=on
enable_shared_relcache=on
failback_command=''
failover_command='/etc/pgpool-II/recovery/failover.sh %d %P %H %R'
failover_if_affected_tuples_mismatch=off
failover_on_backend_error=on
failover_require_consensus=on
failover_when_quorum_exists=off
follow_master_command=''
follow_primary_command=''
health_check_database=''
health_check_max_retries=0
health_check_password='***************'
health_check_period=1
health_check_retry_delay=1
health_check_timeout=0
health_check_user='postgres'
heartbeat_device0='eth0'
heartbeat_device1='eth0'
heartbeat_hostname0='10.65.188.55'
heartbeat_hostname1='10.65.188.56'
heartbeat_port0=9694
heartbeat_port1=9694
hostname0='10.65.188.55'
hostname1='10.65.188.56'
if_cmd_path='/sbin'
if_down_cmd='/usr/bin/sudo /sbin/ip addr del $_IP_$/18 dev eth0'
if_up_cmd='/usr/bin/sudo /sbin/ip addr add $_IP_$/18 dev eth0 label eth0:pg'
ignore_leading_white_space=on
insert_lock=off
listen_addresses='*'
listen_backlog_multiplier=1
load_balance_mode=off
lobj_lock_table=''
log_client_messages=off
log_connections=off
log_destination='stderr'
log_directory='/var/log/pgpool_log'
logdir='/var/log'
log_disconnections=off
log_error_verbosity = verbose
log_filename='pgpool-%a.log'
logging_collector=on
log_hostname=off
log_line_prefix='%t: pid %p: '# printf-style string to output at beginning
of each log line.
log_min_messages = debug1
log_per_node_statement=off
log_rotation_age=1d
log_rotation_size=0
log_standby_delay='if_over_threshold'
log_statement=off
log_truncate_on_rotation=on
master_slave_mode=on
master_slave_sub_mode='stream'
max_pool=1
memory_cache_enabled=off
memqcache_auto_cache_invalidation=on
memqcache_cache_block_size=1MB
memqcache_expire=0
memqcache_maxcache=400kB
memqcache_max_num_cache=1000000
memqcache_memcached_host='localhost'
memqcache_memcached_port=11211
memqcache_method='shmem'
memqcache_oiddir='/var/log/pgpool/oiddir'
memqcache_total_size=64MB
num_init_children=160
pcp_listen_addresses='*'
pcp_port=9898
pcp_socket_dir='/var/run/postgresql'
pgpool_port0=9999
pgpool_port1=9999
pid_file_name='/var/run/pgpool.pid'
ping_path='/bin'
pool_passwd='pool_passwd'
port=9999
primary_routing_query_pattern_list=''
read_only_function_list=''
recovery_1st_stage_command='recovery_1st_stage'
recovery_2nd_stage_command=''
recovery_password='**************'
recovery_timeout=90
recovery_user='postgres'
relcache_expire=0
relcache_query_target=primary
relcache_size=256
replicate_select=off
replication_mode=off
replication_stop_on_mismatch=off
reserved_connections=0
reset_query_list='ABORT; DISCARD ALL'
search_primary_node_timeout=0
serialize_accept=off
socket_dir='/var/run/postgresql'
sr_check_database='postgres'
sr_check_password='fabrix'
sr_check_period=1
sr_check_user='****************'
ssl_ciphers='HIGH:MEDIUM:+3DES:!aNULL'
ssl_dh_params_file=''
ssl_ecdh_curve='prime256v1'
ssl=off
ssl_prefer_server_ciphers=off
statement_level_load_balance=off
syslog_facility='LOCAL0'
syslog_ident='pgpool'
trusted_servers=''
use_watchdog=on
wd_authkey=''
wd_de_escalation_command=''
wd_escalation_command=''
wd_heartbeat_deadtime=30
wd_heartbeat_keepalive=2
wd_interval=10
wd_ipc_socket_dir='/var/run/postgresql'
wd_lifecheck_dbname='template1'
wd_lifecheck_method='heartbeat'
wd_lifecheck_password=''
wd_lifecheck_query='SELECT 1'
wd_lifecheck_user='nobody'
wd_life_point=3
wd_monitoring_interfaces_list=''
wd_port0=9000
wd_port1=9000
wd_priority=1
write_function_list=''
*********************************************************************************************

pool_hba.conf:
*********************************************************************************************
[root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_hba.conf
# TYPE  DATABASE        USER    CIDR-ADDRESS    METHOD
local   all     all     trust
## IPv4 local connections
host    all     all     127.0.0.1/32    md5
host    all     postgres        10.65.188.55/32 md5
host    all     postgres        10.65.188.56/32 md5
host    all     postgres        10.65.188.59/32 md5
host    all     fabrix  10.65.188.55/32 md5
host    all     fabrix  10.65.188.56/32 md5
host    all     fabrix  10.65.188.59/32 md5
## Replication Hosts
host    manager fabrix  10.65.188.55/32 md5
host    manager fabrix  10.65.188.56/32 md5
host    manager fabrix  10.65.188.59/32 md5
[root at pg-mgrdb1 pgpool_log]#
*********************************************************************************************

pool_passwd:
*********************************************************************************************
[root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_passwd
postgres:AESdLGx72XCAdQUybSno/SFuw==
fabrix:AES9Zd8L9XYGiZ5RMycfSFBFw==
[root at pg-mgrdb1 pgpool_log]#
*********************************************************************************************

I'd highly appreciate any ideas or hints for further investigations.

BR
Igor Yurchenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20240714/a4b76351/attachment-0001.htm>