<div dir="ltr">Thank you for your response.<div><br></div><div>Answers to your questions:</div><div>1. I am using pgpool <span style="background-color:rgb(204,204,204)">4.0.4</span></div><div><span style="background-color:rgb(255,255,255)">2. DEBUG was specifically enabled to debug this issue. (PCP commands frozen)</span></div><div><span style="background-color:rgb(255,255,255)">3. Yes, all the mentioned properties are enabled. (All the pgpool configurations are below for the reference)</span></div><div><span style="background-color:rgb(255,255,255)"><br></span></div><div>>If the network goes down, watchdog will detect the network failure and shutdown itself.<br>>To avoid such problems, it is recommended to shutdown pgpool before restarting network.<br></div><div><br></div><div>1. The network was not down; It got restarted. i.e. it came back up in no time (Within seconds)</div><div>In my current understanding, the watchdog settings: <span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px;font-weight:700">wd_interval </span><span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px">and</span> <span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px;font-weight:700">wd_life_point </span><span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px">should have covered/tolerated this network downtime?</span></div><div><span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px">2. Most of the time in my prod environment, the restart or a glitch in network is not in the application control, to pre-emptively stop pgPool.</span></div><div><span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px"><br></span></div><div>My follow-up questions:</div><div>-------</div><div><span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px">1. Am I hitting a bug in pgPool?</span></div><div><span style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px">2. Is this scenario (Network glitch) handled better in a newer PgPool version? (So that I can upgrade, if possible with minimal changes to the confs.)</span></div><div><span style="background-color:rgb(255,255,255)">-------------------------</span></div><div>allow_clear_text_frontend_auth = off<br>allow_multiple_failover_requests_from_node = off<br>allow_sql_comments = off<br>app_name_redirect_preference_list = ''<br>arping_cmd = 'arping -U $_IP_$ -w 1'<br>arping_path = '/sbin'<br>authentication_timeout = 60<br>backend_data_directory0 = '/db/data'<br>backend_data_directory1 = '/db/data'<br>backend_data_directory2 = '/db/data'<br>backend_flag0 = 'ALLOW_TO_FAILOVER'<br>backend_flag1 = 'ALLOW_TO_FAILOVER'<br>backend_flag2 = 'ALLOW_TO_FAILOVER'<br>backend_hostname0 = '10.108.104.31'<br>backend_hostname1 = '10.108.104.32'<br>backend_hostname2 = '10.108.104.33'<br>backend_port0 = 5432<br>backend_port1 = 5432<br>backend_port2 = 5432<br>backend_weight0 = 1<br>backend_weight1 = 1<br>backend_weight2 = 1<br>black_function_list = 'currval,lastval,nextval,setval'<br>black_memqcache_table_list = ''<br>black_query_pattern_list = ''<br>check_temp_table = on<br>check_unlogged_table = on<br>child_life_time = 300<br>child_max_connections = 0<br>clear_memqcache_on_escalation = on<br>client_idle_limit = 0<br>client_idle_limit_in_recovery = 0<br>connect_timeout = 10000<br>connection_cache = on<br>connection_life_time = 0<br>database_redirect_preference_list = ''<br>delay_threshold = 10000000<br>delegate_IP = ''<br>detach_false_primary = off<br>disable_load_balance_on_write = 'transaction'<br>enable_pool_hba = off<br>failback_command = ''<br>failover_command = '/usr/local/etc/failover.sh %d %h %p %D %m %H %M %P %r %R'<br>failover_if_affected_tuples_mismatch = off<br>failover_on_backend_error = on<br>failover_require_consensus = on<br>failover_when_quorum_exists = on<br>follow_master_command = '/usr/local/etc/follow_master.sh %d %h %p %D %m %M %H %P %r %R'<br>health_check_database = ''<br>health_check_max_retries = 3<br>health_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'<br>health_check_period = 5<br>health_check_retry_delay = 1<br>health_check_timeout = 30<br>health_check_user = 'pgpool'<br>heartbeat_destination0 = '10.108.104.32'<br>heartbeat_destination1 = '10.108.104.33'<br>heartbeat_destination_port0 = 9694 <br>heartbeat_destination_port1 = 9694<br>heartbeat_device0 = ''<br>heartbeat_device1 = ''<br>if_down_cmd = ''<br>if_up_cmd = ''<br>ifconfig_path = '/sbin'<br>ignore_leading_white_space = on<br>insert_lock = off<br>listen_addresses = '*'<br>listen_backlog_multiplier = 2<br>load_balance_mode = on<br>lobj_lock_table = ''<br>log_client_messages = off<br>log_connections = off<br>log_destination = 'syslog'<br>log_hostname = off<br>log_line_prefix = '%t: pid %p: ' <br>log_per_node_statement = off<br>log_standby_delay = 'if_over_threshold'<br>log_statement = off<br>logdir = '/tmp'<br>master_slave_mode = on<br>master_slave_sub_mode = 'stream'<br>max_pool = 4<br>memory_cache_enabled = off<br>memqcache_auto_cache_invalidation = on<br>memqcache_cache_block_size = 1048576<br>memqcache_expire = 0<br>memqcache_max_num_cache = 1000000<br>memqcache_maxcache = 409600<br>memqcache_memcached_host = 'localhost'<br>memqcache_memcached_port = 11211<br>memqcache_method = 'shmem'<br>memqcache_oiddir = '/var/log/pgpool/oiddir'<br>memqcache_total_size = 67108864<br>num_init_children = 32<br>other_pgpool_hostname0 = '10.108.104.32'<br>other_pgpool_hostname1 = '10.108.104.33'<br>other_pgpool_port0 = 9999<br>other_pgpool_port1 = 9999<br>other_wd_port0 = 9000<br>other_wd_port1 = 9000<br>pcp_listen_addresses = '*'<br>pcp_port = 9898<br>pcp_socket_dir = '/tmp'<br>pid_file_name = '/var/run/pgpool/pgpool.pid'<br>ping_path = '/bin'<br>pool_passwd = 'pool_passwd'<br>port = 9999<br>recovery_1st_stage_command = 'recovery_1st_stage'<br>recovery_2nd_stage_command = ''<br>recovery_password = 'ZPH3Xnuh8ISKMZjSqLvIBQe_WTOzXbPF'<br>recovery_timeout = 90<br>recovery_user = 'postgres'<br>relcache_expire = 0<br>relcache_size = 256<br>replicate_select = off<br>replication_mode = off<br>replication_stop_on_mismatch = off<br>reset_query_list = 'ABORT; DISCARD ALL'<br>search_primary_node_timeout = 300<br>serialize_accept = off<br>socket_dir = '/var/run/pgpool/socket'<br>sr_check_database = 'postgres'<br>sr_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'<br>sr_check_period = 10<br>sr_check_user = 'pgpool'<br>ssl = off<br>ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'<br>ssl_prefer_server_ciphers = off<br>syslog_facility = 'LOCAL1'<br>syslog_ident = 'pgpool'<br>trusted_servers = ''<br>use_watchdog = on<br>wd_authkey = ''<br>wd_de_escalation_command = '/usr/local/etc/desc.sh'<br>wd_escalation_command = '/usr/local/etc/esc.sh'<br>wd_heartbeat_deadtime = 30<br>wd_heartbeat_keepalive = 2<br>wd_heartbeat_port = 9694<br>wd_hostname = '10.108.104.31'<br>wd_interval = 10<br>wd_ipc_socket_dir = '/tmp'<br>wd_life_point = 3<br>wd_lifecheck_dbname = 'template1'<br>wd_lifecheck_method = 'heartbeat'<br>wd_lifecheck_password = ''<br>wd_lifecheck_query = 'SELECT 1'<br>wd_lifecheck_user = 'nobody'<br>wd_monitoring_interfaces_list = 'any' <br>wd_port = 9000<br>wd_priority = 1<br>white_function_list = ''<br>white_memqcache_table_list = ''<br></div><div><span style="background-color:rgb(255,255,255)"><br></span></div><div><span style="background-color:rgb(255,255,255)"><br></span></div><div><span style="background-color:rgb(204,204,204)"><br></span></div><div><br></div><div><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><b>Thanks</b><br></div><i>Gopi</i><br></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Apr 19, 2023 at 1:13 PM Bo Peng <<a href="mailto:pengbo@sraoss.co.jp">pengbo@sraoss.co.jp</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
> When I do:<br>
> systemctl restart systemd-networkd<br>
> <br>
> After that, I am not able to execute any PCP commands like:<br>
> pcp_watchdog_info<br>
> It is frozen.<br>
<br>
If the network goes down, watchdog will detect the network failure and shutdown itself.<br>
To avoid such problems, it is recommended to shutdown pgpool before restarting network.<br>
<br>
BTW, which version of Pgpool-II are you using?<br>
<br>
> I tried restarting pgpool and postgres to no avail.<br>
> However, rebooting the system gets it back to a workable state. (PCP<br>
> commands are running again and I can attach the nodes back to the pool)<br>
> <br>
> The pgPool logs shows that the pg-pool was shutdown due to the network<br>
> event:<br>
> -------------------------------<br>
> <br>
> 2023-04-17T15:27:25.190949+00:00 vmvrlcm-104-32 g[3042]: [268-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: network event received<br>
> 2023-04-17T15:27:25.191041+00:00 vmvrlcm-104-32 g[3042]: [268-2] 2023-04-17<br>
> 15:27:25: pid 3042: DETAIL: deleted = YES Link change event = NO<br>
> 2023-04-17T15:27:25.191186+00:00 vmvrlcm-104-32 g[3042]: [269-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: STATE MACHINE INVOKED WITH EVENT = NETWORK IP<br>
> IS REMOVED Current State = STANDBY<br>
> 2023-04-17T15:27:25.191243+00:00 vmvrlcm-104-32 g[3042]: [270-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: network interface lo having flags 65609<br>
> 2023-04-17T15:27:25.191296+00:00 vmvrlcm-104-32 g[3042]: [271-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: network interface eth0 having flags 69699<br>
> 2023-04-17T15:27:25.191352+00:00 vmvrlcm-104-32 g[3042]: [272-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: network interface "eth0" link is active<br>
> 2023-04-17T15:27:25.191401+00:00 vmvrlcm-104-32 g[3042]: [273-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: network interface "eth0" link is up<br>
> 2023-04-17T15:27:25.191449+00:00 vmvrlcm-104-32 g[3042]: [274-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: network interface lo having flags 65609<br>
> 2023-04-17T15:27:25.191497+00:00 vmvrlcm-104-32 g[3042]: [275-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: network interface "eth0" is up and we can<br>
> continue<br>
> 2023-04-17T15:27:25.191551+00:00 vmvrlcm-104-32 g[3042]: [276-1] 2023-04-17<br>
> 15:27:25: pid 3042: WARNING: network IP is removed and system has no IP is<br>
> assigned<br>
> 2023-04-17T15:27:25.191614+00:00 vmvrlcm-104-32 g[3042]: [276-2] 2023-04-17<br>
> 15:27:25: pid 3042: DETAIL: changing the state to in network trouble<br>
> 2023-04-17T15:27:25.191667+00:00 vmvrlcm-104-32 g[3042]: [277-1] 2023-04-17<br>
> 15:27:25: pid 3042: LOG: watchdog node state changed from [STANDBY] to [IN<br>
> NETWORK TROUBLE]<br>
> 2023-04-17T15:27:25.191713+00:00 vmvrlcm-104-32 g[3042]: [278-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: STATE MACHINE INVOKED WITH EVENT = STATE<br>
> CHANGED Current State = IN NETWORK TROUBLE<br>
> 2023-04-17T15:27:25.191759+00:00 vmvrlcm-104-32 g[3042]: [279-1] 2023-04-17<br>
> 15:27:25: pid 3042: FATAL: system has lost the network<br>
> 2023-04-17T15:27:25.191807+00:00 vmvrlcm-104-32 g[3042]: [280-1] 2023-04-17<br>
> 15:27:25: pid 3042: LOG: Watchdog is shutting down<br>
> 2023-04-17T15:27:25.191849+00:00 vmvrlcm-104-32 g[3042]: [281-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: sending packet, watchdog node:[<br>
> <a href="http://vmvrlcm-104-31.eng.vmware.com:9999" rel="noreferrer" target="_blank">vmvrlcm-104-31.eng.vmware.com:9999</a> Linux <a href="http://vmvrlcm-104-31.eng.vmware.com" rel="noreferrer" target="_blank">vmvrlcm-104-31.eng.vmware.com</a>]<br>
> command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]<br>
> 2023-04-17T15:27:25.191894+00:00 vmvrlcm-104-32 g[3042]: [282-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: sending watchdog packet to socket:8, type:[X],<br>
> command ID:10, data Length:0<br>
> 2023-04-17T15:27:25.191952+00:00 vmvrlcm-104-32 g[3042]: [283-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: sending packet, watchdog node:[<br>
> <a href="http://vmvrlcm-104-33.eng.vmware.com:9999" rel="noreferrer" target="_blank">vmvrlcm-104-33.eng.vmware.com:9999</a> Linux <a href="http://vmvrlcm-104-33.eng.vmware.com" rel="noreferrer" target="_blank">vmvrlcm-104-33.eng.vmware.com</a>]<br>
> command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]<br>
> 2023-04-17T15:27:25.192001+00:00 vmvrlcm-104-32 g[3042]: [284-1] 2023-04-17<br>
> 15:27:25: pid 3042: DEBUG: sending watchdog packet to socket:9, type:[X],<br>
> command ID:10, data Length:0<br>
> 2023-04-17T15:27:25.192671+00:00 vmvrlcm-104-32 pgpool[3040]: [24-1]<br>
> 2023-04-17 15:27:25: pid 3040: DEBUG: reaper handler<br>
> 2023-04-17T15:27:25.192753+00:00 vmvrlcm-104-32 pgpool[3040]: [25-1]<br>
> 2023-04-17 15:27:25: pid 3040: DEBUG: watchdog child process with pid:<br>
> 3042 exit with FATAL ERROR. pgpool-II will be shutdown<br>
> 2023-04-17T15:27:25.192803+00:00 vmvrlcm-104-32 pgpool[3040]: [26-1]<br>
> 2023-04-17 15:27:25: pid 3040: LOG: watchdog child process with pid: 3042<br>
> exits with status 768<br>
> 2023-04-17T15:27:25.192864+00:00 vmvrlcm-104-32 pgpool[3040]: [27-1]<br>
> 2023-04-17 15:27:25: pid 3040: FATAL: watchdog child process exit with<br>
> fatal error. exiting pgpool-II<br>
> 2023-04-17T15:27:25.197530+00:00 vmvrlcm-104-32 ck[3157]: [23-1] 2023-04-17<br>
> 15:27:25: pid 3157: DEBUG: lifecheck child receives shutdown request<br>
> signal 2, forwarding to all children<br>
> 2023-04-17T15:27:25.197611+00:00 vmvrlcm-104-32 ck[3157]: [24-1] 2023-04-17<br>
> 15:27:25: pid 3157: DEBUG: lifecheck child receives fast shutdown request<br>
> 2023-04-17T15:27:25.197658+00:00 vmvrlcm-104-32 at sender[3159]: [148-1]<br>
> 2023-04-17 15:27:25: pid 3159: DEBUG: watchdog heartbeat sender child<br>
> receives shutdown request signal 2<br>
> 2023-04-17T15:27:25.197994+00:00 vmvrlcm-104-32 at sender[3163]: [148-1]<br>
> 2023-04-17 15:27:25: pid 3163: DEBUG: watchdog heartbeat sender child<br>
> receives shutdown request signal 2<br>
> 2023-04-17T15:27:25.199168+00:00 vmvrlcm-104-32 at receiver[3161]: [18-1]<br>
> 2023-04-17 15:27:25: pid 3161: DEBUG: watchdog heartbeat receiver child<br>
> receives shutdown request signal 2<br>
> 2023-04-17T15:27:25.199567+00:00 vmvrlcm-104-32 at receiver[3158]: [18-1]<br>
> 2023-04-17 15:27:25: pid 3158: DEBUG: watchdog heartbeat receiver child<br>
> receives shutdown request signal 2<br>
> 2023-04-17T15:27:25.448554+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [386-1] 2023-04-17 15:27:25: pid 3197: DEBUG: health check: clearing alarm<br>
> 2023-04-17T15:27:25.448689+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [387-1] 2023-04-17 15:27:25: pid 3197: DEBUG: SSL is requested but SSL<br>
> support is not available<br>
> 2023-04-17T15:27:25.450621+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [388-1] 2023-04-17 15:27:25: pid 3197: DEBUG: authenticate kind = 5<br>
> 2023-04-17T15:27:25.451892+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [389-1] 2023-04-17 15:27:25: pid 3197: DEBUG: authenticate backend: key<br>
> data received<br>
> 2023-04-17T15:27:25.451987+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [390-1] 2023-04-17 15:27:25: pid 3197: DEBUG: authenticate backend:<br>
> transaction state: I<br>
> 2023-04-17T15:27:25.452043+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [391-1] 2023-04-17 15:27:25: pid 3197: DEBUG: health check: clearing alarm<br>
> 2023-04-17T15:27:25.452096+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [392-1] 2023-04-17 15:27:25: pid 3197: DEBUG: health check: clearing alarm<br>
> 2023-04-17T15:27:25.455020+00:00 vmvrlcm-104-32 check process(0)[3196]:<br>
> [386-1] 2023-04-17 15:27:25: pid 3196: DEBUG: health check: clearing alarm<br>
> 2023-04-17T15:27:25.455096+00:00 vmvrlcm-104-32 check process(0)[3196]:<br>
> [387-1] 2023-04-17 15:27:25: pid 3196: DEBUG: SSL is requested but SSL<br>
> support is not available<br>
> 2023-04-17T15:27:25.457196+00:00 vmvrlcm-104-32 check process(0)[3196]:<br>
> [388-1] 2023-04-17 15:27:25: pid 3196: DEBUG: authenticate kind = 5<br>
> 2023-04-17T15:27:25.458437+00:00 vmvrlcm-104-32 check process(0)[3196]:<br>
> [389-1] 2023-04-17 15:27:25: pid 3196: DEBUG: authenticate backend: key<br>
> data received<br>
> 2023-04-17T15:27:25.458556+00:00 vmvrlcm-104-32 check process(0)[3196]:<br>
> [390-1] 2023-04-17 15:27:25: pid 3196: DEBUG: authenticate backend:<br>
> transaction state: I<br>
> 2023-04-17T15:27:25.458674+00:00 vmvrlcm-104-32 check process(0)[3196]:<br>
> [391-1] 2023-04-17 15:27:25: pid 3196: DEBUG: health check: clearing alarm<br>
> 2023-04-17T15:27:25.458742+00:00 vmvrlcm-104-32 check process(0)[3196]:<br>
> [392-1] 2023-04-17 15:27:25: pid 3196: DEBUG: health check: clearing alarm<br>
> 2023-04-17T15:27:30.452427+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [393-1] 2023-04-17 15:27:30: pid 3197: DEBUG: health check: clearing alarm<br>
> 2023-04-17T15:27:30.454041+00:00 vmvrlcm-104-32 check process(2)[3197]:<br>
> [394-1] 2023-04-17 15:27:30: pid 3197: DEBUG: SSL is requested but SSL<br>
> support is not available<br>
> <br>
> ------------------<br>
> <br>
> After this, it is in kind of a loop of the 'clearing alarm' + 'SSL support<br>
> is not available'<br>
> <br>
> The relevant (In my current understanding) watchdog settings are:<br>
> ----------------------------------------------------------------------------------------<br>
> wd_hostname = '10.108.104.31'<br>
> wd_lifecheck_method = 'heartbeat'<br>
> wd_interval = 10<br>
> wd_heartbeat_keepalive = 2<br>
> wd_heartbeat_deadtime = 30<br>
> heartbeat_destination0 = '10.108.104.32'<br>
> heartbeat_device0 = ''<br>
> heartbeat_destination1 = '10.108.104.33'<br>
> heartbeat_device1 = ''<br>
> wd_monitoring_interfaces_list = 'any'<br>
<br>
Above logs are DEBUG messages and I don't think they caused this issue.<br>
Do these DEBUG messages only appear when you restart the network?<br>
<br>
If you are using watchdog, you also need to configure the following parameters:<br>
<br>
heartbeat_destination_port0<br>
heartbeat_destination_port1<br>
other_pgpool_hostname0<br>
other_pgpool_port0<br>
other_pgpool_hostname1<br>
other_pgpool_port1<br>
<br>
Regards,<br>
<br>
-- <br>
Bo Peng <<a href="mailto:pengbo@sraoss.co.jp" target="_blank">pengbo@sraoss.co.jp</a>><br>
SRA OSS LLC<br>
<a href="https://www.sraoss.co.jp/" rel="noreferrer" target="_blank">https://www.sraoss.co.jp/</a><br>
</blockquote></div>