[pgpool-hackers: 4488] Re: Pgpool-4.2 doesn't see primary node of Postgresql15 'stream' cluster

Mon Jul 15 17:56:14 JST 2024

Hi Tatsuo

Yeah, it was the reason. Great thanks!!!
Removing the /var/log/pgpool_status solves the problem.

The only remark that as I understand, (at least on pgpool-4.2) a pcp
port/socket becomes available only after successful backend
initialization. So, in my case without the initialized primary backend
all pcp commands didn't work.

BR
Igor Yurchenko

On Mon, 15 Jul 2024 at 06:51, Tatsuo Ishii <ishii at postgresql.org> wrote:

> Hi,
>
> My guess is, pgpool_status (should be placed under logdir) file is set
> to:
>
> down
> up
>
> i.e. pgpool thinks that the PostgreSQL primary node is detached from
> pgpool. If you are sure that the PostgreSQL primary is healthy, you
> can attach the primary node by using pcp_attached_node. Alternative
> method is, stopping pgpool; removing the pgpool_status file;
> restarting pgpool. pgpool will automatically create pgpool_status file
> and mark the primary up if it is up and running.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
>
> > Hi guys
> > I have given up and it looks like I need your help.
> >
> > A setup:
> > I have a setup with 2 hosts of Postgresql15 stream cluster (IPs:
> > 10.65.188.55, 10.65.188.56 respectively) with a 2 Pgpool 4.2.15 instances
> > installed in the same hosts.
> > The pgpool instances manage/delegate on a VIP 10.65.188.59.
> >
> > A replication between pg15 instances work well and the instances recovery
> > status detected reliably:
> >
> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> > fabrix -h 10.65.188.55 -p 5432 template1 -c 'select pg_is_in_recovery()'
> >  pg_is_in_recovery
> > -------------------
> >  f
> > (1 row)
> >
> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> > fabrix -h 10.65.188.56 -p 5432 template1 -c 'select pg_is_in_recovery()'
> >  pg_is_in_recovery
> > -------------------
> >  t
> > (1 row)
> >
> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> > fabrix -h 10.65.188.55 -p 5432 template1 -c 'select * from
> > pg_stat_replication'
> >   pid  | usesysid | usename | application_name | client_addr  |
> > client_hostname | client_port |         backend_start         |
> > backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn |
> replay_lsn |
> > write_lag | f
> > lush_lag | replay_lag | sync_priority | sync_state |          reply_time
> >
> -------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+--
> >
> ---------+------------+---------------+------------+-------------------------------
> >  13875 |    16384 | fabrix  | walreceiver      | 10.65.188.56 |
> >     |       47636 | 2024-07-14 14:30:54.344682+03 |          743 |
> > streaming | 0/B0001C0 | 0/B0001C0 | 0/B0001C0 | 0/B0001C0  |           |
> >          |            |             0 | async      | 2024-07-14
> > 14:48:41.550424+03
> > (1 row)
> >
> > [root at pg-mgrdb1 pgpool-II]#
> >
> > The databases of the pg15 cluster are reachable/available via VIP:
> >
> > [root at pg-mgrdb1 pgpool-II]#
> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> > fabrix -h 10.65.188.59 -p 9999 manager -c '\dt'
> >                  List of relations
> >  Schema |         Name          | Type  |  Owner
> > --------+-----------------------+-------+----------
> >  public | upgrade_version_table | table | postgres
> > (1 row)
> >
> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> > fabrix -h 10.65.188.59 -p 9999 manager -c 'select * from
> > upgrade_version_table'
> >     app_name    | cur_version | upgrade_to
> > ----------------+-------------+------------
> >  brODtv Manager | 10.2.0.300  |
> > (1 row)
> >
> > [root at pg-mgrdb1 pgpool-II]#
> >
> > But, the pgpool cannot detect the cluster primary node, and provides
> > incorrect pool_nodes info:
> >
> > [root at pg-mgrdb1 pgpool-II]#  PGPASSWORD=fabrix /usr/pgsql-15/bin/psql -U
> > fabrix -h 10.65.188.59 -p 9999 manager -c 'show pool_nodes'
> >  node_id |   hostname   | port | status | lb_weight |  role   |
> select_cnt
> > | load_balance_node | replication_delay | replication_state |
> > replication_sync_state | last_status_change
> >
> ---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
> >  0       | 10.65.188.55 | 5432 | down   | 0.500000  | standby | 0
> >  | false             | 0                 |                   |
> >            | 2024-07-14 15:04:01
> >  1       | 10.65.188.56 | 5432 | up     | 0.500000  | standby | 3
> >  | true              | 0                 |                   |
> >            | 2024-07-14 15:04:01
> > (2 rows)
> >
> > [root at pg-mgrdb1 pgpool-II]#
> >
> > Right now, I'm testing it with a "single pgpool up" configuration (pgpool
> > on the second host is down). The  pgpool log shows that the pgpool
> doesn't
> > even try to touch the first (current primary) backend at
> 10.65.188.55:5432.
> > It successfully connects to the second 'standby' node and detects its
> > status. And this second node handles successfully all SELECT queries due
> to
> > load balancing configured. But what's wrong with first primary node?
> >
> > In the pgpool log I don't see any tries to connect to the first node. The
> > piece of the log where "find_primary_node" mentioned first looks so:
> >
> > 2024-07-14 15:23:38: pid 18723: LOG:  2 watchdog nodes are configured for
> > lifecheck
> > 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:495
> > 2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:0 Name:"
> > 10.65.188.55:9999 Linux pg-mgrdb1"
> > 2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.55" WD Port:9000
> > pgpool-II port:9999
> > 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
> > 2024-07-14 15:23:38: pid 18723: LOG:  watchdog nodes ID:1 Name:"Not_Set"
> > 2024-07-14 15:23:38: pid 18723: DETAIL:  Host:"10.65.188.56" WD Port:9000
> > pgpool-II port:9999
> > 2024-07-14 15:23:38: pid 18723: LOCATION:  wd_lifecheck.c:503
> > 2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node_repeatedly:
> waiting
> > for finding a primary node
> > 2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3404
> > 2024-07-14 15:23:38: pid 18706: LOG:  find_primary_node: standby node is
> 1
> > 2024-07-14 15:23:38: pid 18706: LOCATION:  pgpool_main.c:3329
> > 2024-07-14 15:23:39: pid 18729: LOG:  set SO_REUSEPORT option to the
> socket
> > 2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:691
> > 2024-07-14 15:23:39: pid 18729: LOG:  creating watchdog heartbeat receive
> > socket.
> > 2024-07-14 15:23:39: pid 18729: DETAIL:  set SO_REUSEPORT
> > 2024-07-14 15:23:39: pid 18729: LOCATION:  wd_heartbeat.c:231
> > 2024-07-14 15:23:39: pid 18731: LOG:  set SO_REUSEPORT option to the
> socket
> > 2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:691
> > 2024-07-14 15:23:39: pid 18731: LOG:  creating socket for sending
> heartbeat
> > 2024-07-14 15:23:39: pid 18731: DETAIL:  set SO_REUSEPORT
> > 2024-07-14 15:23:39: pid 18731: LOCATION:  wd_heartbeat.c:148
> > 2024-07-14 15:23:39: pid 18706: LOG:  find_primary_node: standby node is
> 1
> > 2024-07-14 15:23:39: pid 18706: LOCATION:  pgpool_main.c:3329
> > 2024-07-14 15:23:40: pid 18706: LOG:  find_primary_node: standby node is
> 1
> > 2024-07-14 15:23:40: pid 18706: LOCATION:  pgpool_main.c:3329
> > 2024-07-14 15:23:41: pid 18706: LOG:  find_primary_node: standby node is
> 1
> > 2024-07-14 15:23:41: pid 18706: LOCATION:  pgpool_main.c:3329
> >
> > So I have no ideas anymore.
> >
> > The cleaned from comments and  sorted pgpool.conf looks so:
> >
> *********************************************************************************************
> > allow_clear_text_frontend_auth=off
> > allow_multiple_failover_requests_from_node=off
> > allow_sql_comments=off
> > app_name_redirect_preference_list=''
> > arping_cmd='/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'
> > arping_path='/usr/sbin'
> > authentication_timeout=1min
> > auto_failback_interval=1min
> > auto_failback=off
> > backend_application_name0=mgrdb1
> > backend_application_name1=mgrdb2
> > backend_clustering_mode='streaming_replication'
> > backend_data_directory0='/home/postgres/databases/fabrix2'
> > backend_data_directory1='/home/postgres/databases/fabrix2'
> > backend_flag0='ALLOW_TO_FAILOVER'
> > backend_flag1='ALLOW_TO_FAILOVER'
> > backend_hostname0='10.65.188.55'
> > backend_hostname1='10.65.188.56'
> > backend_port0=5432
> > backend_port1=5432
> > backend_weight0=1
> > backend_weight1=1
> > black_function_list='currval,lastval,nextval,setval'
> > cache_safe_memqcache_table_list=''
> > cache_unsafe_memqcache_table_list=''
> > check_temp_table=catalog
> > check_unlogged_table=on
> > child_life_time=1
> > child_max_connections=0
> > clear_memqcache_on_escalation=on
> > client_idle_limit=0
> > client_idle_limit_in_recovery=0
> > client_min_messages = debug1
> > connection_cache=on
> > connection_life_time=0
> > connect_timeout=10000
> > database_redirect_preference_list=''
> > delay_threshold=10000000
> > delegate_IP='10.65.188.59'
> > detach_false_primary=off
> > disable_load_balance_on_write='transaction'
> > dml_adaptive_object_relationship_list=''
> > enable_consensus_with_half_votes=on
> > enable_pool_hba=on
> > enable_shared_relcache=on
> > failback_command=''
> > failover_command='/etc/pgpool-II/recovery/failover.sh %d %P %H %R'
> > failover_if_affected_tuples_mismatch=off
> > failover_on_backend_error=on
> > failover_require_consensus=on
> > failover_when_quorum_exists=off
> > follow_master_command=''
> > follow_primary_command=''
> > health_check_database=''
> > health_check_max_retries=0
> > health_check_password='***************'
> > health_check_period=1
> > health_check_retry_delay=1
> > health_check_timeout=0
> > health_check_user='postgres'
> > heartbeat_device0='eth0'
> > heartbeat_device1='eth0'
> > heartbeat_hostname0='10.65.188.55'
> > heartbeat_hostname1='10.65.188.56'
> > heartbeat_port0=9694
> > heartbeat_port1=9694
> > hostname0='10.65.188.55'
> > hostname1='10.65.188.56'
> > if_cmd_path='/sbin'
> > if_down_cmd='/usr/bin/sudo /sbin/ip addr del $_IP_$/18 dev eth0'
> > if_up_cmd='/usr/bin/sudo /sbin/ip addr add $_IP_$/18 dev eth0 label
> eth0:pg'
> > ignore_leading_white_space=on
> > insert_lock=off
> > listen_addresses='*'
> > listen_backlog_multiplier=1
> > load_balance_mode=off
> > lobj_lock_table=''
> > log_client_messages=off
> > log_connections=off
> > log_destination='stderr'
> > log_directory='/var/log/pgpool_log'
> > logdir='/var/log'
> > log_disconnections=off
> > log_error_verbosity = verbose
> > log_filename='pgpool-%a.log'
> > logging_collector=on
> > log_hostname=off
> > log_line_prefix='%t: pid %p: '# printf-style string to output at
> beginning
> > of each log line.
> > log_min_messages = debug1
> > log_per_node_statement=off
> > log_rotation_age=1d
> > log_rotation_size=0
> > log_standby_delay='if_over_threshold'
> > log_statement=off
> > log_truncate_on_rotation=on
> > master_slave_mode=on
> > master_slave_sub_mode='stream'
> > max_pool=1
> > memory_cache_enabled=off
> > memqcache_auto_cache_invalidation=on
> > memqcache_cache_block_size=1MB
> > memqcache_expire=0
> > memqcache_maxcache=400kB
> > memqcache_max_num_cache=1000000
> > memqcache_memcached_host='localhost'
> > memqcache_memcached_port=11211
> > memqcache_method='shmem'
> > memqcache_oiddir='/var/log/pgpool/oiddir'
> > memqcache_total_size=64MB
> > num_init_children=160
> > pcp_listen_addresses='*'
> > pcp_port=9898
> > pcp_socket_dir='/var/run/postgresql'
> > pgpool_port0=9999
> > pgpool_port1=9999
> > pid_file_name='/var/run/pgpool.pid'
> > ping_path='/bin'
> > pool_passwd='pool_passwd'
> > port=9999
> > primary_routing_query_pattern_list=''
> > read_only_function_list=''
> > recovery_1st_stage_command='recovery_1st_stage'
> > recovery_2nd_stage_command=''
> > recovery_password='**************'
> > recovery_timeout=90
> > recovery_user='postgres'
> > relcache_expire=0
> > relcache_query_target=primary
> > relcache_size=256
> > replicate_select=off
> > replication_mode=off
> > replication_stop_on_mismatch=off
> > reserved_connections=0
> > reset_query_list='ABORT; DISCARD ALL'
> > search_primary_node_timeout=0
> > serialize_accept=off
> > socket_dir='/var/run/postgresql'
> > sr_check_database='postgres'
> > sr_check_password='fabrix'
> > sr_check_period=1
> > sr_check_user='****************'
> > ssl_ciphers='HIGH:MEDIUM:+3DES:!aNULL'
> > ssl_dh_params_file=''
> > ssl_ecdh_curve='prime256v1'
> > ssl=off
> > ssl_prefer_server_ciphers=off
> > statement_level_load_balance=off
> > syslog_facility='LOCAL0'
> > syslog_ident='pgpool'
> > trusted_servers=''
> > use_watchdog=on
> > wd_authkey=''
> > wd_de_escalation_command=''
> > wd_escalation_command=''
> > wd_heartbeat_deadtime=30
> > wd_heartbeat_keepalive=2
> > wd_interval=10
> > wd_ipc_socket_dir='/var/run/postgresql'
> > wd_lifecheck_dbname='template1'
> > wd_lifecheck_method='heartbeat'
> > wd_lifecheck_password=''
> > wd_lifecheck_query='SELECT 1'
> > wd_lifecheck_user='nobody'
> > wd_life_point=3
> > wd_monitoring_interfaces_list=''
> > wd_port0=9000
> > wd_port1=9000
> > wd_priority=1
> > write_function_list=''
> >
> *********************************************************************************************
> >
> > pool_hba.conf:
> >
> *********************************************************************************************
> > [root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_hba.conf
> > # TYPE  DATABASE        USER    CIDR-ADDRESS    METHOD
> > local   all     all     trust
> > ## IPv4 local connections
> > host    all     all     127.0.0.1/32    md5
> > host    all     postgres        10.65.188.55/32 md5
> > host    all     postgres        10.65.188.56/32 md5
> > host    all     postgres        10.65.188.59/32 md5
> > host    all     fabrix  10.65.188.55/32 md5
> > host    all     fabrix  10.65.188.56/32 md5
> > host    all     fabrix  10.65.188.59/32 md5
> > ## Replication Hosts
> > host    manager fabrix  10.65.188.55/32 md5
> > host    manager fabrix  10.65.188.56/32 md5
> > host    manager fabrix  10.65.188.59/32 md5
> > [root at pg-mgrdb1 pgpool_log]#
> >
> *********************************************************************************************
> >
> > pool_passwd:
> >
> *********************************************************************************************
> > [root at pg-mgrdb1 pgpool_log]# cat /etc/pgpool-II/pool_passwd
> > postgres:AESdLGx72XCAdQUybSno/SFuw==
> > fabrix:AES9Zd8L9XYGiZ5RMycfSFBFw==
> > [root at pg-mgrdb1 pgpool_log]#
> >
> *********************************************************************************************
> >
> > I'd highly appreciate any ideas or hints for further investigations.
> >
> > BR
> > Igor Yurchenko
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20240715/5d7695cf/attachment-0001.htm>