<div dir="ltr"><div>Hi guys</div><div>I have given up and it looks like I need your help. <br><br></div><div>A setup:<br></div><div>I have a setup with 2 hosts of Postgresql15 stream cluster (IPs: 10.65.188.55, 10.65.188.56 respectively) with a 2 Pgpool 4.2.15 instances installed in the same hosts. <br>The pgpool instances manage/delegate on a VIP 10.65.188.59. <br></div><div><br></div><div>A replication between pg15 instances work well and the instances recovery status detected reliably: <br><br><span style="font-family:monospace">[root@pg-mgrdb1 pgpool-II]# PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U fabrix -h 10.65.188.55 -p 5432 template1 -c 'select pg_is_in_recovery()'<br> pg_is_in_recovery<br>-------------------<br> f<br>(1 row)<br><br>[root@pg-mgrdb1 pgpool-II]# PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U fabrix -h 10.65.188.56 -p 5432 template1 -c 'select pg_is_in_recovery()'<br> pg_is_in_recovery<br>-------------------<br> t<br>(1 row)<br><br>[root@pg-mgrdb1 pgpool-II]# PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U fabrix -h 10.65.188.55 -p 5432 template1 -c 'select * from pg_stat_replication'<br> pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | f<br>lush_lag | replay_lag | sync_priority | sync_state | reply_time<br>-------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+--<br>---------+------------+---------------+------------+-------------------------------<br> 13875 | 16384 | fabrix | walreceiver | 10.65.188.56 | | 47636 | 2024-07-14 14:30:54.344682+03 | 743 | streaming | 0/B0001C0 | 0/B0001C0 | 0/B0001C0 | 0/B0001C0 | |<br> | | 0 | async | 2024-07-14 14:48:41.550424+03<br>(1 row)<br><br>[root@pg-mgrdb1 pgpool-II]#</span><br></div><div><br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>The databases of the pg15 cluster are reachable/available via VIP:<br></div><div><br></div><div><span style="font-family:monospace">[root@pg-mgrdb1 pgpool-II]#<br>[root@pg-mgrdb1 pgpool-II]# PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U fabrix -h 10.65.188.59 -p 9999 manager -c '\dt'<br> List of relations<br> Schema | Name | Type | Owner<br>--------+-----------------------+-------+----------<br> public | upgrade_version_table | table | postgres<br>(1 row)<br><br>[root@pg-mgrdb1 pgpool-II]# PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U fabrix -h 10.65.188.59 -p 9999 manager -c 'select * from upgrade_version_table'<br> app_name | cur_version | upgrade_to<br>----------------+-------------+------------<br> brODtv Manager | 10.2.0.300 |<br>(1 row)<br><br>[root@pg-mgrdb1 pgpool-II]#</span><br></div><div><br></div><div>But, the pgpool cannot detect the cluster primary node, and provides incorrect pool_nodes info:<br></div><div><br></div><div><span style="font-family:monospace">[root@pg-mgrdb1 pgpool-II]# PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U fabrix -h 10.65.188.59 -p 9999 manager -c 'show pool_nodes'<br> node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change<br>---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------<br> 0 | 10.65.188.55 | 5432 | down | 0.500000 | standby | 0 | false | 0 | | | 2024-07-14 15:04:01<br> 1 | 10.65.188.56 | 5432 | up | 0.500000 | standby | 3 | true | 0 | | | 2024-07-14 15:04:01<br>(2 rows)<br><br>[root@pg-mgrdb1 pgpool-II]#</span><br></div><div><br></div><div>Right now, I'm testing it with a "single pgpool up" configuration (pgpool on the second host is down). The pgpool log shows that the pgpool doesn't even try to touch the first (current primary) backend at <a href="http://10.65.188.55:5432" target="_blank">10.65.188.55:5432</a>. It successfully connects to the second 'standby' node and detects its status. And this second node handles successfully all SELECT queries due to load balancing configured. But what's wrong with first primary node? <br><br></div><div>In the pgpool log I don't see any tries to connect to the first node. The piece of the log where "find_primary_node" mentioned first looks so: <br></div><div><br></div><div><span style="font-family:monospace">2024-07-14 15:23:38: pid 18723: LOG: 2 watchdog nodes are configured for lifecheck<br>2024-07-14 15:23:38: pid 18723: LOCATION: wd_lifecheck.c:495<br>2024-07-14 15:23:38: pid 18723: LOG: watchdog nodes ID:0 Name:"<a href="http://10.65.188.55:9999">10.65.188.55:9999</a> Linux pg-mgrdb1"<br>2024-07-14 15:23:38: pid 18723: DETAIL: Host:"10.65.188.55" WD Port:9000 pgpool-II port:9999<br>2024-07-14 15:23:38: pid 18723: LOCATION: wd_lifecheck.c:503<br>2024-07-14 15:23:38: pid 18723: LOG: watchdog nodes ID:1 Name:"Not_Set"<br>2024-07-14 15:23:38: pid 18723: DETAIL: Host:"10.65.188.56" WD Port:9000 pgpool-II port:9999<br>2024-07-14 15:23:38: pid 18723: LOCATION: wd_lifecheck.c:503<br>2024-07-14 15:23:38: pid 18706: LOG: find_primary_node_repeatedly: waiting for finding a primary node<br>2024-07-14 15:23:38: pid 18706: LOCATION: pgpool_main.c:3404<br>2024-07-14 15:23:38: pid 18706: LOG: find_primary_node: standby node is 1<br>2024-07-14 15:23:38: pid 18706: LOCATION: pgpool_main.c:3329<br>2024-07-14 15:23:39: pid 18729: LOG: set SO_REUSEPORT option to the socket<br>2024-07-14 15:23:39: pid 18729: LOCATION: wd_heartbeat.c:691<br>2024-07-14 15:23:39: pid 18729: LOG: creating watchdog heartbeat receive socket.<br>2024-07-14 15:23:39: pid 18729: DETAIL: set SO_REUSEPORT<br>2024-07-14 15:23:39: pid 18729: LOCATION: wd_heartbeat.c:231<br>2024-07-14 15:23:39: pid 18731: LOG: set SO_REUSEPORT option to the socket<br>2024-07-14 15:23:39: pid 18731: LOCATION: wd_heartbeat.c:691<br>2024-07-14 15:23:39: pid 18731: LOG: creating socket for sending heartbeat<br>2024-07-14 15:23:39: pid 18731: DETAIL: set SO_REUSEPORT<br>2024-07-14 15:23:39: pid 18731: LOCATION: wd_heartbeat.c:148<br>2024-07-14 15:23:39: pid 18706: LOG: find_primary_node: standby node is 1<br>2024-07-14 15:23:39: pid 18706: LOCATION: pgpool_main.c:3329<br>2024-07-14 15:23:40: pid 18706: LOG: find_primary_node: standby node is 1<br>2024-07-14 15:23:40: pid 18706: LOCATION: pgpool_main.c:3329<br>2024-07-14 15:23:41: pid 18706: LOG: find_primary_node: standby node is 1<br>2024-07-14 15:23:41: pid 18706: LOCATION: pgpool_main.c:3329</span><br><br></div><div>So I have no ideas anymore. <br><br>The cleaned from comments and sorted pgpool.conf looks so: <br>*********************************************************************************************<br></div><div>
<div><span style="font-family:monospace">allow_clear_text_frontend_auth=off<br>allow_multiple_failover_requests_from_node=off<br>allow_sql_comments=off<br>app_name_redirect_preference_list=''<br>arping_cmd='/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'<br>arping_path='/usr/sbin'<br>authentication_timeout=1min<br>auto_failback_interval=1min<br>auto_failback=off<br>backend_application_name0=mgrdb1<br>backend_application_name1=mgrdb2<br>backend_clustering_mode='streaming_replication'<br>backend_data_directory0='/home/postgres/databases/fabrix2'<br>backend_data_directory1='/home/postgres/databases/fabrix2'<br>backend_flag0='ALLOW_TO_FAILOVER'<br>backend_flag1='ALLOW_TO_FAILOVER'<br>backend_hostname0='10.65.188.55'<br>backend_hostname1='10.65.188.56'<br>backend_port0=5432<br>backend_port1=5432<br>backend_weight0=1<br>backend_weight1=1<br>black_function_list='currval,lastval,nextval,setval'<br>cache_safe_memqcache_table_list=''<br>cache_unsafe_memqcache_table_list=''<br>check_temp_table=catalog<br>check_unlogged_table=on<br>child_life_time=1<br>child_max_connections=0<br>clear_memqcache_on_escalation=on<br>client_idle_limit=0<br>client_idle_limit_in_recovery=0<br>client_min_messages = debug1</span></div><div><span style="font-family:monospace">connection_cache=on<br>connection_life_time=0<br>connect_timeout=10000<br>database_redirect_preference_list=''<br>delay_threshold=10000000<br>delegate_IP='10.65.188.59'<br>detach_false_primary=off<br>disable_load_balance_on_write='transaction'<br>dml_adaptive_object_relationship_list=''<br>enable_consensus_with_half_votes=on<br>enable_pool_hba=on<br>enable_shared_relcache=on<br>failback_command=''<br>failover_command='/etc/pgpool-II/recovery/failover.sh %d %P %H %R'<br>failover_if_affected_tuples_mismatch=off<br>failover_on_backend_error=on<br>failover_require_consensus=on<br>failover_when_quorum_exists=off<br>follow_master_command=''<br>follow_primary_command=''<br>health_check_database=''<br>health_check_max_retries=0<br>health_check_password='***************'<br>health_check_period=1<br>health_check_retry_delay=1<br>health_check_timeout=0<br>health_check_user='postgres'<br>heartbeat_device0='eth0'<br>heartbeat_device1='eth0'<br>heartbeat_hostname0='10.65.188.55'<br>heartbeat_hostname1='10.65.188.56'<br>heartbeat_port0=9694<br>heartbeat_port1=9694<br>hostname0='10.65.188.55'<br>hostname1='10.65.188.56'<br>if_cmd_path='/sbin'<br>if_down_cmd='/usr/bin/sudo /sbin/ip addr del $_IP_$/18 dev eth0'<br>if_up_cmd='/usr/bin/sudo /sbin/ip addr add $_IP_$/18 dev eth0 label eth0:pg'<br>ignore_leading_white_space=on<br>insert_lock=off<br>listen_addresses='*'<br>listen_backlog_multiplier=1<br>load_balance_mode=off<br>lobj_lock_table=''<br></span></div><div><span style="font-family:monospace">log_client_messages=off<br>log_connections=off<br>log_destination='stderr'<br>log_directory='/var/log/pgpool_log'<br>logdir='/var/log'<br>log_disconnections=off<br>log_error_verbosity = verbose<br>log_filename='pgpool-%a.log'<br>logging_collector=on<br>log_hostname=off<br>log_line_prefix='%t: pid %p: '# printf-style string to output at beginning of each log line.<br>log_min_messages = debug1<br>log_per_node_statement=off<br>log_rotation_age=1d<br>log_rotation_size=0<br>log_standby_delay='if_over_threshold'<br>log_statement=off<br>log_truncate_on_rotation=on<br>master_slave_mode=on<br>master_slave_sub_mode='stream'<br>max_pool=1<br>memory_cache_enabled=off<br>memqcache_auto_cache_invalidation=on<br>memqcache_cache_block_size=1MB<br>memqcache_expire=0<br>memqcache_maxcache=400kB<br>memqcache_max_num_cache=1000000<br>memqcache_memcached_host='localhost'<br>memqcache_memcached_port=11211<br>memqcache_method='shmem'<br>memqcache_oiddir='/var/log/pgpool/oiddir'<br>memqcache_total_size=64MB<br>num_init_children=160<br>pcp_listen_addresses='*'<br>pcp_port=9898<br>pcp_socket_dir='/var/run/postgresql'<br>pgpool_port0=9999<br>pgpool_port1=9999<br>pid_file_name='/var/run/pgpool.pid'<br>ping_path='/bin'<br>pool_passwd='pool_passwd'<br>port=9999<br>primary_routing_query_pattern_list=''<br>read_only_function_list=''<br>recovery_1st_stage_command='recovery_1st_stage'<br>recovery_2nd_stage_command=''<br>recovery_password='**************'<br>recovery_timeout=90<br>recovery_user='postgres'</span></div></div><div><span style="font-family:monospace">relcache_expire=0<br>relcache_query_target=primary<br>relcache_size=256<br>replicate_select=off<br>replication_mode=off<br>replication_stop_on_mismatch=off<br>reserved_connections=0<br>reset_query_list='ABORT; DISCARD ALL'<br>search_primary_node_timeout=0<br>serialize_accept=off<br>socket_dir='/var/run/postgresql'<br>sr_check_database='postgres'<br>sr_check_password='fabrix'<br>sr_check_period=1<br>sr_check_user='****************'<br>ssl_ciphers='HIGH:MEDIUM:+3DES:!aNULL'<br>ssl_dh_params_file=''<br>ssl_ecdh_curve='prime256v1'<br>ssl=off<br>ssl_prefer_server_ciphers=off<br>statement_level_load_balance=off<br>syslog_facility='LOCAL0'<br>syslog_ident='pgpool'<br>trusted_servers=''<br>use_watchdog=on<br>wd_authkey=''<br>wd_de_escalation_command=''<br>wd_escalation_command=''<br>wd_heartbeat_deadtime=30<br>wd_heartbeat_keepalive=2<br>wd_interval=10<br>wd_ipc_socket_dir='/var/run/postgresql'<br>wd_lifecheck_dbname='template1'<br>wd_lifecheck_method='heartbeat'<br>wd_lifecheck_password=''<br>wd_lifecheck_query='SELECT 1'<br>wd_lifecheck_user='nobody'<br>wd_life_point=3<br>wd_monitoring_interfaces_list=''<br>wd_port0=9000<br>wd_port1=9000<br>wd_priority=1<br>write_function_list=''</span><br><div>*********************************************************************************************<br></div><div><br></div><div>pool_hba.conf:<br>
<div>*********************************************************************************************<br></div><div><span style="font-family:monospace">[root@pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_hba.conf<br># TYPE DATABASE USER CIDR-ADDRESS METHOD<br>local all all trust<br>## IPv4 local connections<br>host all all <a href="http://127.0.0.1/32">127.0.0.1/32</a> md5<br>host all postgres <a href="http://10.65.188.55/32">10.65.188.55/32</a> md5<br>host all postgres <a href="http://10.65.188.56/32">10.65.188.56/32</a> md5<br>host all postgres <a href="http://10.65.188.59/32">10.65.188.59/32</a> md5<br>host all fabrix <a href="http://10.65.188.55/32">10.65.188.55/32</a> md5<br>host all fabrix <a href="http://10.65.188.56/32">10.65.188.56/32</a> md5<br>host all fabrix <a href="http://10.65.188.59/32">10.65.188.59/32</a> md5<br>## Replication Hosts<br>host manager fabrix <a href="http://10.65.188.55/32">10.65.188.55/32</a> md5<br>host manager fabrix <a href="http://10.65.188.56/32">10.65.188.56/32</a> md5<br>host manager fabrix <a href="http://10.65.188.59/32">10.65.188.59/32</a> md5<br>[root@pg-mgrdb1 pgpool_log]#</span><br></div>
</div><div>
<div>*********************************************************************************************<br></div><div></div>
</div><div><br>pool_passwd:<br>
<div>
<div>*********************************************************************************************<br></div><div></div>
</div><div></div>
<span style="font-family:monospace">[root@pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_passwd<br>postgres:AESdLGx72XCAdQUybSno/SFuw==<br>fabrix:AES9Zd8L9XYGiZ5RMycfSFBFw==<br>[root@pg-mgrdb1 pgpool_log]#</span><br></div>
</div><div>
<div>
<div>*********************************************************************************************<br></div><div></div>
</div><div></div>
<br></div><div>I'd highly appreciate any ideas or hints for further investigations. <br></div><div><br>BR</div><div>Igor Yurchenko<br></div></div></div></div></div></div>