[pgpool-hackers: 4489] Re: Pgpool-4.2 doesn't see primary node of Postgresql15 'stream' cluster

Tue Jul 16 09:27:09 JST 2024

> Hi Tatsuo
> 
> Yeah, it was the reason. Great thanks!!!
> Removing the /var/log/pgpool_status solves the problem.

Glad to hear that!

> The only remark that as I understand, (at least on pgpool-4.2) a pcp
> port/socket becomes available only after successful backend
> initialization. So, in my case without the initialized primary backend
> all pcp commands didn't work.

More precisely, pcp commands are available after searching the primary
node is done.  In your case, until the operation done or
search_primary_node_timeout passed, pcp commands are not available
because there's no primary node at the moment.  Since the default
value of search_primary_node_timeout is 5 minutes, you needed to wait
for 5 minutes before pcp commands are available.  You can make
search_primary_node_timeout shorter if you like.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> BR
> Igor Yurchenko
> 
> 
> On Mon, 15 Jul 2024 at 06:51, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> Hi,
>>
>> My guess is, pgpool_status (should be placed under logdir) file is set
>> to:
>>
>> down
>> up
>>
>> i.e. pgpool thinks that the PostgreSQL primary node is detached from
>> pgpool. If you are sure that the PostgreSQL primary is healthy, you
>> can attach the primary node by using pcp_attached_node. Alternative
>> method is, stopping pgpool; removing the pgpool_status file;
>> restarting pgpool. pgpool will automatically create pgpool_status file
>> and mark the primary up if it is up and running.
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>>
>> > Hi guys
>> > I have given up and it looks like I need your help.
>> >
>> > A setup:
>> > I have a setup with 2 hosts of Postgresql15 stream cluster (IPs:
>> > 10.65.188.55, 10.65.188.56 respectively) with a 2 Pgpool 4.2.15 instances
>> > installed in the same hosts.
>> > The pgpool instances manage/delegate on a VIP 10.65.188.59.
>> >
>> > A replication between pg15 instances work well and the instances recovery
>> > status detected reliably:
>> >
>> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
>> > fabrix -h 10.65.188.55 -p 5432 template1 -c 'select pg_is_in_recovery()'
>> >  pg_is_in_recovery
>> > -------------------
>> >  f
>> > (1 row)
>> >
>> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
>> > fabrix -h 10.65.188.56 -p 5432 template1 -c 'select pg_is_in_recovery()'
>> >  pg_is_in_recovery
>> > -------------------
>> >  t
>> > (1 row)
>> >
>> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
>> > fabrix -h 10.65.188.55 -p 5432 template1 -c 'select * from
>> > pg_stat_replication'
>> >   pid  | usesysid | usename | application_name | client_addr  |
>> > client_hostname | client_port |         backend_start         |
>> > backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn |
>> replay_lsn |
>> > write_lag | f
>> > lush_lag | replay_lag | sync_priority | sync_state |          reply_time
>> >
>> -------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+--
>> >
>> ---------+------------+---------------+------------+-------------------------------
>> >  13875 |    16384 | fabrix  | walreceiver      | 10.65.188.56 |
>> >     |       47636 | 2024-07-14 14:30:54.344682+03 |          743 |
>> > streaming | 0/B0001C0 | 0/B0001C0 | 0/B0001C0 | 0/B0001C0  |           |
>> >          |            |             0 | async      | 2024-07-14
>> > 14:48:41.550424+03
>> > (1 row)
>> >
>> > [root at pg-mgrdb1 pgpool-II]#
>> >
>> > The databases of the pg15 cluster are reachable/available via VIP:
>> >
>> > [root at pg-mgrdb1 pgpool-II]#
>> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
>> > fabrix -h 10.65.188.59 -p 9999 manager -c '\dt'
>> >                  List of relations
>> >  Schema |         Name          | Type  |  Owner
>> > --------+-----------------------+-------+----------
>> >  public | upgrade_version_table | table | postgres
>> > (1 row)
>> >
>> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
>> > fabrix -h 10.65.188.59 -p 9999 manager -c 'select * from
>> > upgrade_version_table'
>> >     app_name    | cur_version | upgrade_to
>> > ----------------+-------------+------------
>> >  brODtv Manager | 10.2.0.300  |
>> > (1 row)
>> >
>> > [root at pg-mgrdb1 pgpool-II]#
>> >
>> > But, the pgpool cannot detect the cluster primary node, and provides
>> > incorrect pool_nodes info:
>> >
>> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
>> > fabrix -h 10.65.188.59 -p 9999 manager -c 'show pool_nodes'
>> >  node_id |   hostname   | port | status | lb_weight |  role   |
>> select_cnt
>> > | load_balance_node | replication_delay | replication_state |
>> > replication_sync_state | last_status_change
>> >
>> ---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>> >  0       | 10.65.188.55 | 5432 | down   | 0.500000  | standby | 0
>> >  | false             | 0                 |                   |
>> >            | 2024-07-14 15:04:01
>> >  1       | 10.65.188.56 | 5432 | up     | 0.500000  | standby | 3
>> >  | true              | 0                 |                   |
>> >            | 2024-07-14 15:04:01
>> > (2 rows)
>> >
>> > [root at pg-mgrdb1 pgpool-II]#
>> >
>> > Right now, I'm testing it with a "single pgpool up" configuration (pgpool
>> > on the second host is down). The  pgpool log shows that the pgpool
>> doesn't
>> > even try to touch the first (current primary) backend at
>> 10.65.188.55:5432.
>> > It successfully connects to the second 'standby' node and detects its
>> > status. And this second node handles successfully all SELECT queries due
>> to
>> > load balancing configured. But what's wrong with first primary node?
>> >
>> > In the pgpool log I don't see any tries to connect to the first node. The
>> > piece of the log where "find_primary_node" mentioned first looks so:
>> >
>> > 2024-07-14 15:23:38: pid 18723: LOG:  2 watchdog nodes are configured for
>> > lifecheck
>> > 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:495
>> > 2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:0 Name:"
>> > 10.65.188.55:9999 Linux pg-mgrdb1"
>> > 2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.55" WD Port:9000
>> > pgpool-II port:9999
>> > 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
>> > 2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:1 Name:"Not_Set"
>> > 2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.56" WD Port:9000
>> > pgpool-II port:9999
>> > 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
>> > 2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node_repeatedly:
>> waiting
>> > for finding a primary node
>> > 2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3404
>> > 2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node: standby node is
>> 1
>> > 2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3329
>> > 2024-07-14 15:23:39: pid 18729: LOG:  set SO_REUSEPORT option to the
>> socket
>> > 2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:691
>> > 2024-07-14 15:23:39: pid 18729: LOG:  creating watchdog heartbeat receive
>> > socket.
>> > 2024-07-14 15:23:39: pid 18729: DETAIL:  set SO_REUSEPORT
>> > 2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:231
>> > 2024-07-14 15:23:39: pid 18731: LOG:  set SO_REUSEPORT option to the
>> socket
>> > 2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:691
>> > 2024-07-14 15:23:39: pid 18731: LOG:  creating socket for sending
>> heartbeat
>> > 2024-07-14 15:23:39: pid 18731: DETAIL:  set SO_REUSEPORT
>> > 2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:148
>> > 2024-07-14 15:23:39: pid 18706: LOG:  find_primary_node: standby node is
>> 1
>> > 2024-07-14 15:23:39: pid 18706: LOCATION:  pgpool_main.c:3329
>> > 2024-07-14 15:23:40: pid 18706: LOG:  find_primary_node: standby node is
>> 1
>> > 2024-07-14 15:23:40: pid 18706: LOCATION:  pgpool_main.c:3329
>> > 2024-07-14 15:23:41: pid 18706: LOG:  find_primary_node: standby node is
>> 1
>> > 2024-07-14 15:23:41: pid 18706: LOCATION:  pgpool_main.c:3329
>> >
>> > So I have no ideas anymore.
>> >
>> > The cleaned from comments and  sorted pgpool.conf looks so:
>> >
>> *********************************************************************************************
>> > allow_clear_text_frontend_auth=off
>> > allow_multiple_failover_requests_from_node=off
>> > allow_sql_comments=off
>> > app_name_redirect_preference_list=''
>> > arping_cmd='/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'
>> > arping_path='/usr/sbin'
>> > authentication_timeout=1min
>> > auto_failback_interval=1min
>> > auto_failback=off
>> > backend_application_name0=mgrdb1
>> > backend_application_name1=mgrdb2
>> > backend_clustering_mode='streaming_replication'
>> > backend_data_directory0='/home/postgres/databases/fabrix2'
>> > backend_data_directory1='/home/postgres/databases/fabrix2'
>> > backend_flag0='ALLOW_TO_FAILOVER'
>> > backend_flag1='ALLOW_TO_FAILOVER'
>> > backend_hostname0='10.65.188.55'
>> > backend_hostname1='10.65.188.56'
>> > backend_port0=5432
>> > backend_port1=5432
>> > backend_weight0=1
>> > backend_weight1=1
>> > black_function_list='currval,lastval,nextval,setval'
>> > cache_safe_memqcache_table_list=''
>> > cache_unsafe_memqcache_table_list=''
>> > check_temp_table=catalog
>> > check_unlogged_table=on
>> > child_life_time=1
>> > child_max_connections=0
>> > clear_memqcache_on_escalation=on
>> > client_idle_limit=0
>> > client_idle_limit_in_recovery=0
>> > client_min_messages = debug1
>> > connection_cache=on
>> > connection_life_time=0
>> > connect_timeout=10000
>> > database_redirect_preference_list=''
>> > delay_threshold=10000000
>> > delegate_IP='10.65.188.59'
>> > detach_false_primary=off
>> > disable_load_balance_on_write='transaction'
>> > dml_adaptive_object_relationship_list=''
>> > enable_consensus_with_half_votes=on
>> > enable_pool_hba=on
>> > enable_shared_relcache=on
>> > failback_command=''
>> > failover_command='/etc/pgpool-II/recovery/failover.sh %d %P %H %R'
>> > failover_if_affected_tuples_mismatch=off
>> > failover_on_backend_error=on
>> > failover_require_consensus=on
>> > failover_when_quorum_exists=off
>> > follow_master_command=''
>> > follow_primary_command=''
>> > health_check_database=''
>> > health_check_max_retries=0
>> > health_check_password='***************'
>> > health_check_period=1
>> > health_check_retry_delay=1
>> > health_check_timeout=0
>> > health_check_user='postgres'
>> > heartbeat_device0='eth0'
>> > heartbeat_device1='eth0'
>> > heartbeat_hostname0='10.65.188.55'
>> > heartbeat_hostname1='10.65.188.56'
>> > heartbeat_port0=9694
>> > heartbeat_port1=9694
>> > hostname0='10.65.188.55'
>> > hostname1='10.65.188.56'
>> > if_cmd_path='/sbin'
>> > if_down_cmd='/usr/bin/sudo /sbin/ip addr del $_IP_$/18 dev eth0'
>> > if_up_cmd='/usr/bin/sudo /sbin/ip addr add $_IP_$/18 dev eth0 label
>> eth0:pg'
>> > ignore_leading_white_space=on
>> > insert_lock=off
>> > listen_addresses='*'
>> > listen_backlog_multiplier=1
>> > load_balance_mode=off
>> > lobj_lock_table=''
>> > log_client_messages=off
>> > log_connections=off
>> > log_destination='stderr'
>> > log_directory='/var/log/pgpool_log'
>> > logdir='/var/log'
>> > log_disconnections=off
>> > log_error_verbosity = verbose
>> > log_filename='pgpool-%a.log'
>> > logging_collector=on
>> > log_hostname=off
>> > log_line_prefix='%t: pid %p: '# printf-style string to output at
>> beginning
>> > of each log line.
>> > log_min_messages = debug1
>> > log_per_node_statement=off
>> > log_rotation_age=1d
>> > log_rotation_size=0
>> > log_standby_delay='if_over_threshold'
>> > log_statement=off
>> > log_truncate_on_rotation=on
>> > master_slave_mode=on
>> > master_slave_sub_mode='stream'
>> > max_pool=1
>> > memory_cache_enabled=off
>> > memqcache_auto_cache_invalidation=on
>> > memqcache_cache_block_size=1MB
>> > memqcache_expire=0
>> > memqcache_maxcache=400kB
>> > memqcache_max_num_cache=1000000
>> > memqcache_memcached_host='localhost'
>> > memqcache_memcached_port=11211
>> > memqcache_method='shmem'
>> > memqcache_oiddir='/var/log/pgpool/oiddir'
>> > memqcache_total_size=64MB
>> > num_init_children=160
>> > pcp_listen_addresses='*'
>> > pcp_port=9898
>> > pcp_socket_dir='/var/run/postgresql'
>> > pgpool_port0=9999
>> > pgpool_port1=9999
>> > pid_file_name='/var/run/pgpool.pid'
>> > ping_path='/bin'
>> > pool_passwd='pool_passwd'
>> > port=9999
>> > primary_routing_query_pattern_list=''
>> > read_only_function_list=''
>> > recovery_1st_stage_command='recovery_1st_stage'
>> > recovery_2nd_stage_command=''
>> > recovery_password='**************'
>> > recovery_timeout=90
>> > recovery_user='postgres'
>> > relcache_expire=0
>> > relcache_query_target=primary
>> > relcache_size=256
>> > replicate_select=off
>> > replication_mode=off
>> > replication_stop_on_mismatch=off
>> > reserved_connections=0
>> > reset_query_list='ABORT; DISCARD ALL'
>> > search_primary_node_timeout=0
>> > serialize_accept=off
>> > socket_dir='/var/run/postgresql'
>> > sr_check_database='postgres'
>> > sr_check_password='fabrix'
>> > sr_check_period=1
>> > sr_check_user='****************'
>> > ssl_ciphers='HIGH:MEDIUM:+3DES:!aNULL'
>> > ssl_dh_params_file=''
>> > ssl_ecdh_curve='prime256v1'
>> > ssl=off
>> > ssl_prefer_server_ciphers=off
>> > statement_level_load_balance=off
>> > syslog_facility='LOCAL0'
>> > syslog_ident='pgpool'
>> > trusted_servers=''
>> > use_watchdog=on
>> > wd_authkey=''
>> > wd_de_escalation_command=''
>> > wd_escalation_command=''
>> > wd_heartbeat_deadtime=30
>> > wd_heartbeat_keepalive=2
>> > wd_interval=10
>> > wd_ipc_socket_dir='/var/run/postgresql'
>> > wd_lifecheck_dbname='template1'
>> > wd_lifecheck_method='heartbeat'
>> > wd_lifecheck_password=''
>> > wd_lifecheck_query='SELECT 1'
>> > wd_lifecheck_user='nobody'
>> > wd_life_point=3
>> > wd_monitoring_interfaces_list=''
>> > wd_port0=9000
>> > wd_port1=9000
>> > wd_priority=1
>> > write_function_list=''
>> >
>> *********************************************************************************************
>> >
>> > pool_hba.conf:
>> >
>> *********************************************************************************************
>> > [root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_hba.conf
>> > # TYPE  DATABASE        USER    CIDR-ADDRESS    METHOD
>> > local   all     all     trust
>> > ## IPv4 local connections
>> > host    all     all     127.0.0.1/32    md5
>> > host    all     postgres        10.65.188.55/32 md5
>> > host    all     postgres        10.65.188.56/32 md5
>> > host    all     postgres        10.65.188.59/32 md5
>> > host    all     fabrix  10.65.188.55/32 md5
>> > host    all     fabrix  10.65.188.56/32 md5
>> > host    all     fabrix  10.65.188.59/32 md5
>> > ## Replication Hosts
>> > host    manager fabrix  10.65.188.55/32 md5
>> > host    manager fabrix  10.65.188.56/32 md5
>> > host    manager fabrix  10.65.188.59/32 md5
>> > [root at pg-mgrdb1 pgpool_log]#
>> >
>> *********************************************************************************************
>> >
>> > pool_passwd:
>> >
>> *********************************************************************************************
>> > [root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_passwd
>> > postgres:AESdLGx72XCAdQUybSno/SFuw==
>> > fabrix:AES9Zd8L9XYGiZ5RMycfSFBFw==
>> > [root at pg-mgrdb1 pgpool_log]#
>> >
>> *********************************************************************************************
>> >
>> > I'd highly appreciate any ideas or hints for further investigations.
>> >
>> > BR
>> > Igor Yurchenko
>>