[pgpool-hackers: 4487] Re: Pgpool-4.2 doesn't see primary node of Postgresql15 'stream' cluster

Tatsuo Ishii ishii at postgresql.org
Mon Jul 15 12:50:55 JST 2024


Hi,

My guess is, pgpool_status (should be placed under logdir) file is set
to:

down
up

i.e. pgpool thinks that the PostgreSQL primary node is detached from
pgpool. If you are sure that the PostgreSQL primary is healthy, you
can attach the primary node by using pcp_attached_node. Alternative
method is, stopping pgpool; removing the pgpool_status file;
restarting pgpool. pgpool will automatically create pgpool_status file
and mark the primary up if it is up and running.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


> Hi guys
> I have given up and it looks like I need your help.
> 
> A setup:
> I have a setup with 2 hosts of Postgresql15 stream cluster (IPs:
> 10.65.188.55, 10.65.188.56 respectively) with a 2 Pgpool 4.2.15 instances
> installed in the same hosts.
> The pgpool instances manage/delegate on a VIP 10.65.188.59.
> 
> A replication between pg15 instances work well and the instances recovery
> status detected reliably:
> 
> [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> fabrix -h 10.65.188.55 -p 5432 template1 -c 'select pg_is_in_recovery()'
>  pg_is_in_recovery
> -------------------
>  f
> (1 row)
> 
> [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> fabrix -h 10.65.188.56 -p 5432 template1 -c 'select pg_is_in_recovery()'
>  pg_is_in_recovery
> -------------------
>  t
> (1 row)
> 
> [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> fabrix -h 10.65.188.55 -p 5432 template1 -c 'select * from
> pg_stat_replication'
>   pid  | usesysid | usename | application_name | client_addr  |
> client_hostname | client_port |         backend_start         |
> backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn |
> write_lag | f
> lush_lag | replay_lag | sync_priority | sync_state |          reply_time
> -------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+--
> ---------+------------+---------------+------------+-------------------------------
>  13875 |    16384 | fabrix  | walreceiver      | 10.65.188.56 |
>     |       47636 | 2024-07-14 14:30:54.344682+03 |          743 |
> streaming | 0/B0001C0 | 0/B0001C0 | 0/B0001C0 | 0/B0001C0  |           |
>          |            |             0 | async      | 2024-07-14
> 14:48:41.550424+03
> (1 row)
> 
> [root at pg-mgrdb1 pgpool-II]#
> 
> The databases of the pg15 cluster are reachable/available via VIP:
> 
> [root at pg-mgrdb1 pgpool-II]#
> [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> fabrix -h 10.65.188.59 -p 9999 manager -c '\dt'
>                  List of relations
>  Schema |         Name          | Type  |  Owner
> --------+-----------------------+-------+----------
>  public | upgrade_version_table | table | postgres
> (1 row)
> 
> [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> fabrix -h 10.65.188.59 -p 9999 manager -c 'select * from
> upgrade_version_table'
>     app_name    | cur_version | upgrade_to
> ----------------+-------------+------------
>  brODtv Manager | 10.2.0.300  |
> (1 row)
> 
> [root at pg-mgrdb1 pgpool-II]#
> 
> But, the pgpool cannot detect the cluster primary node, and provides
> incorrect pool_nodes info:
> 
> [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> fabrix -h 10.65.188.59 -p 9999 manager -c 'show pool_nodes'
>  node_id |   hostname   | port | status | lb_weight |  role   | select_cnt
> | load_balance_node | replication_delay | replication_state |
> replication_sync_state | last_status_change
> ---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>  0       | 10.65.188.55 | 5432 | down   | 0.500000  | standby | 0
>  | false             | 0                 |                   |
>            | 2024-07-14 15:04:01
>  1       | 10.65.188.56 | 5432 | up     | 0.500000  | standby | 3
>  | true              | 0                 |                   |
>            | 2024-07-14 15:04:01
> (2 rows)
> 
> [root at pg-mgrdb1 pgpool-II]#
> 
> Right now, I'm testing it with a "single pgpool up" configuration (pgpool
> on the second host is down). The  pgpool log shows that the pgpool doesn't
> even try to touch the first (current primary) backend at 10.65.188.55:5432.
> It successfully connects to the second 'standby' node and detects its
> status. And this second node handles successfully all SELECT queries due to
> load balancing configured. But what's wrong with first primary node?
> 
> In the pgpool log I don't see any tries to connect to the first node. The
> piece of the log where "find_primary_node" mentioned first looks so:
> 
> 2024-07-14 15:23:38: pid 18723: LOG:  2 watchdog nodes are configured for
> lifecheck
> 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:495
> 2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:0 Name:"
> 10.65.188.55:9999 Linux pg-mgrdb1"
> 2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.55" WD Port:9000
> pgpool-II port:9999
> 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
> 2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:1 Name:"Not_Set"
> 2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.56" WD Port:9000
> pgpool-II port:9999
> 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
> 2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node_repeatedly: waiting
> for finding a primary node
> 2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3404
> 2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node: standby node is 1
> 2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3329
> 2024-07-14 15:23:39: pid 18729: LOG:  set SO_REUSEPORT option to the socket
> 2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:691
> 2024-07-14 15:23:39: pid 18729: LOG:  creating watchdog heartbeat receive
> socket.
> 2024-07-14 15:23:39: pid 18729: DETAIL:  set SO_REUSEPORT
> 2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:231
> 2024-07-14 15:23:39: pid 18731: LOG:  set SO_REUSEPORT option to the socket
> 2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:691
> 2024-07-14 15:23:39: pid 18731: LOG:  creating socket for sending heartbeat
> 2024-07-14 15:23:39: pid 18731: DETAIL:  set SO_REUSEPORT
> 2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:148
> 2024-07-14 15:23:39: pid 18706: LOG:  find_primary_node: standby node is 1
> 2024-07-14 15:23:39: pid 18706: LOCATION:  pgpool_main.c:3329
> 2024-07-14 15:23:40: pid 18706: LOG:  find_primary_node: standby node is 1
> 2024-07-14 15:23:40: pid 18706: LOCATION:  pgpool_main.c:3329
> 2024-07-14 15:23:41: pid 18706: LOG:  find_primary_node: standby node is 1
> 2024-07-14 15:23:41: pid 18706: LOCATION:  pgpool_main.c:3329
> 
> So I have no ideas anymore.
> 
> The cleaned from comments and  sorted pgpool.conf looks so:
> *********************************************************************************************
> allow_clear_text_frontend_auth=off
> allow_multiple_failover_requests_from_node=off
> allow_sql_comments=off
> app_name_redirect_preference_list=''
> arping_cmd='/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'
> arping_path='/usr/sbin'
> authentication_timeout=1min
> auto_failback_interval=1min
> auto_failback=off
> backend_application_name0=mgrdb1
> backend_application_name1=mgrdb2
> backend_clustering_mode='streaming_replication'
> backend_data_directory0='/home/postgres/databases/fabrix2'
> backend_data_directory1='/home/postgres/databases/fabrix2'
> backend_flag0='ALLOW_TO_FAILOVER'
> backend_flag1='ALLOW_TO_FAILOVER'
> backend_hostname0='10.65.188.55'
> backend_hostname1='10.65.188.56'
> backend_port0=5432
> backend_port1=5432
> backend_weight0=1
> backend_weight1=1
> black_function_list='currval,lastval,nextval,setval'
> cache_safe_memqcache_table_list=''
> cache_unsafe_memqcache_table_list=''
> check_temp_table=catalog
> check_unlogged_table=on
> child_life_time=1
> child_max_connections=0
> clear_memqcache_on_escalation=on
> client_idle_limit=0
> client_idle_limit_in_recovery=0
> client_min_messages = debug1
> connection_cache=on
> connection_life_time=0
> connect_timeout=10000
> database_redirect_preference_list=''
> delay_threshold=10000000
> delegate_IP='10.65.188.59'
> detach_false_primary=off
> disable_load_balance_on_write='transaction'
> dml_adaptive_object_relationship_list=''
> enable_consensus_with_half_votes=on
> enable_pool_hba=on
> enable_shared_relcache=on
> failback_command=''
> failover_command='/etc/pgpool-II/recovery/failover.sh %d %P %H %R'
> failover_if_affected_tuples_mismatch=off
> failover_on_backend_error=on
> failover_require_consensus=on
> failover_when_quorum_exists=off
> follow_master_command=''
> follow_primary_command=''
> health_check_database=''
> health_check_max_retries=0
> health_check_password='***************'
> health_check_period=1
> health_check_retry_delay=1
> health_check_timeout=0
> health_check_user='postgres'
> heartbeat_device0='eth0'
> heartbeat_device1='eth0'
> heartbeat_hostname0='10.65.188.55'
> heartbeat_hostname1='10.65.188.56'
> heartbeat_port0=9694
> heartbeat_port1=9694
> hostname0='10.65.188.55'
> hostname1='10.65.188.56'
> if_cmd_path='/sbin'
> if_down_cmd='/usr/bin/sudo /sbin/ip addr del $_IP_$/18 dev eth0'
> if_up_cmd='/usr/bin/sudo /sbin/ip addr add $_IP_$/18 dev eth0 label eth0:pg'
> ignore_leading_white_space=on
> insert_lock=off
> listen_addresses='*'
> listen_backlog_multiplier=1
> load_balance_mode=off
> lobj_lock_table=''
> log_client_messages=off
> log_connections=off
> log_destination='stderr'
> log_directory='/var/log/pgpool_log'
> logdir='/var/log'
> log_disconnections=off
> log_error_verbosity = verbose
> log_filename='pgpool-%a.log'
> logging_collector=on
> log_hostname=off
> log_line_prefix='%t: pid %p: '# printf-style string to output at beginning
> of each log line.
> log_min_messages = debug1
> log_per_node_statement=off
> log_rotation_age=1d
> log_rotation_size=0
> log_standby_delay='if_over_threshold'
> log_statement=off
> log_truncate_on_rotation=on
> master_slave_mode=on
> master_slave_sub_mode='stream'
> max_pool=1
> memory_cache_enabled=off
> memqcache_auto_cache_invalidation=on
> memqcache_cache_block_size=1MB
> memqcache_expire=0
> memqcache_maxcache=400kB
> memqcache_max_num_cache=1000000
> memqcache_memcached_host='localhost'
> memqcache_memcached_port=11211
> memqcache_method='shmem'
> memqcache_oiddir='/var/log/pgpool/oiddir'
> memqcache_total_size=64MB
> num_init_children=160
> pcp_listen_addresses='*'
> pcp_port=9898
> pcp_socket_dir='/var/run/postgresql'
> pgpool_port0=9999
> pgpool_port1=9999
> pid_file_name='/var/run/pgpool.pid'
> ping_path='/bin'
> pool_passwd='pool_passwd'
> port=9999
> primary_routing_query_pattern_list=''
> read_only_function_list=''
> recovery_1st_stage_command='recovery_1st_stage'
> recovery_2nd_stage_command=''
> recovery_password='**************'
> recovery_timeout=90
> recovery_user='postgres'
> relcache_expire=0
> relcache_query_target=primary
> relcache_size=256
> replicate_select=off
> replication_mode=off
> replication_stop_on_mismatch=off
> reserved_connections=0
> reset_query_list='ABORT; DISCARD ALL'
> search_primary_node_timeout=0
> serialize_accept=off
> socket_dir='/var/run/postgresql'
> sr_check_database='postgres'
> sr_check_password='fabrix'
> sr_check_period=1
> sr_check_user='****************'
> ssl_ciphers='HIGH:MEDIUM:+3DES:!aNULL'
> ssl_dh_params_file=''
> ssl_ecdh_curve='prime256v1'
> ssl=off
> ssl_prefer_server_ciphers=off
> statement_level_load_balance=off
> syslog_facility='LOCAL0'
> syslog_ident='pgpool'
> trusted_servers=''
> use_watchdog=on
> wd_authkey=''
> wd_de_escalation_command=''
> wd_escalation_command=''
> wd_heartbeat_deadtime=30
> wd_heartbeat_keepalive=2
> wd_interval=10
> wd_ipc_socket_dir='/var/run/postgresql'
> wd_lifecheck_dbname='template1'
> wd_lifecheck_method='heartbeat'
> wd_lifecheck_password=''
> wd_lifecheck_query='SELECT 1'
> wd_lifecheck_user='nobody'
> wd_life_point=3
> wd_monitoring_interfaces_list=''
> wd_port0=9000
> wd_port1=9000
> wd_priority=1
> write_function_list=''
> *********************************************************************************************
> 
> pool_hba.conf:
> *********************************************************************************************
> [root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_hba.conf
> # TYPE  DATABASE        USER    CIDR-ADDRESS    METHOD
> local   all     all     trust
> ## IPv4 local connections
> host    all     all     127.0.0.1/32    md5
> host    all     postgres        10.65.188.55/32 md5
> host    all     postgres        10.65.188.56/32 md5
> host    all     postgres        10.65.188.59/32 md5
> host    all     fabrix  10.65.188.55/32 md5
> host    all     fabrix  10.65.188.56/32 md5
> host    all     fabrix  10.65.188.59/32 md5
> ## Replication Hosts
> host    manager fabrix  10.65.188.55/32 md5
> host    manager fabrix  10.65.188.56/32 md5
> host    manager fabrix  10.65.188.59/32 md5
> [root at pg-mgrdb1 pgpool_log]#
> *********************************************************************************************
> 
> pool_passwd:
> *********************************************************************************************
> [root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_passwd
> postgres:AESdLGx72XCAdQUybSno/SFuw==
> fabrix:AES9Zd8L9XYGiZ5RMycfSFBFw==
> [root at pg-mgrdb1 pgpool_log]#
> *********************************************************************************************
> 
> I'd highly appreciate any ideas or hints for further investigations.
> 
> BR
> Igor Yurchenko


More information about the pgpool-hackers mailing list