[pgpool-hackers: 3519] Re: Proposal: health check statistics
Tatsuo Ishii
ishii at sraoss.co.jp
Tue Feb 25 17:28:13 JST 2020
I have added dedicated PCP command and pgpool_adm extension.
See manuals for more details.
http://tatsuo-ishii.github.io/pgpool-II/current/pcp-health-check-stats.html
http://tatsuo-ishii.github.io/pgpool-II/current/pgpool-adm-pcp-health-check-stats.html
> Pushed with bug fixes and enhancements along with an SGML document for
> new "show pool_health_check_stats".
>
>> Ok, here is the first cut of this work (see attached patches). I
>> implemented "show pool_health_check_stats;" command to show health
>> check statistics stored in shared memory. Below is a sample session of
>> that. There are two backend nodes 0 and 1. 1 was shutdown and failover
>> happened. Then it automatically failed back because auto_failback =
>> on. Facts you can see from it include:
>>
>> - node 0's last_failed_health_check is empty because there's no failed
>> health happened on node 0.
>>
>> - node 0's last_skip_health_check is also empty. As health check
>> skipping happens against downed node, which does not happen on node
>> 0.
>>
>> - on node 1 fail_count = 1 as failover happened once.
>>
>> - on node 1 skip_count = 4, which means health check skipping happened
>> 4 times until node 1 failed back.
>>
>> - on node 1 retry_count is 4, which means health check retried 4 times
>> until it decided node 1 failed.
>>
>> - duration of each health check is observed on node 0 and 1. max duration on node 0 is 657 ms
>> while min duration is 557 ms.
>>
>> - on node 1, health check last performed at 20:38:22 but it was
>> actually skipped because last_skip_health_check recorded the same
>> time. The health check triggered the faiover at 20:37:12. Actually
>> in the log we can find following line at the same time.
>>
>> 2020-01-25 20:37:12: pid 29238: LOG: health check failed on node 1 (timeout:0)
>>
>> test=# show pool_health_check_stats;
>> -[ RECORD 1 ]----------------+--------------------
>> node_id | 0
>> hostname | /tmp
>> port | 11002
>> status | up
>> last_status_change | 2020-01-25 20:36:03
>> total_count | 15
>> success_count | 15
>> fail_count | 0
>> skip_count | 0
>> retry_count | 0
>> average_retry_count | 0.000000
>> max_retry_count | 0
>> max_duration | 657
>> min_duration | 557
>> average_duration | 606.800000
>> last_health_check | 2020-01-25 20:38:19
>> last_successful_health_check | 2020-01-25 20:38:19
>> last_skip_health_check |
>> last_failed_health_check |
>> -[ RECORD 2 ]----------------+--------------------
>> node_id | 1
>> hostname | /tmp
>> port | 11003
>> status | waiting
>> last_status_change | 2020-01-25 20:38:22
>> total_count | 15
>> success_count | 7
>> fail_count | 1
>> skip_count | 7
>> retry_count | 4
>> average_retry_count | 0.266667
>> max_retry_count | 4
>> max_duration | 623
>> min_duration | 557
>> average_duration | 593.000000
>> last_health_check | 2020-01-25 20:38:22
>> last_successful_health_check | 2020-01-25 20:36:59
>> last_skip_health_check | 2020-01-25 20:38:22
>> last_failed_health_check | 2020-01-25 20:37:12
>>
>> BTW, I am not sure if followings should be included this work;
>>
>>> - cause of the status change (failover, failback etc.)
>>> - last 10 status change timestamp and it's status at the time ("10" should be configurable)
>>
>> Because they are best handled by failover process. I would like to
>> focus on health check statistics.
>>
>> Next work will be:
>>
>> - More tests.
>> - Implement PCP command.
>> - Implement pgpool_adm function.
>> - Write documents.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
More information about the pgpool-hackers
mailing list