[pgpool-hackers: 4245] Re: Issue with failover_require_consensus

Sat Dec 17 11:11:35 JST 2022

>> On Tue, Nov 29, 2022 at 3:27 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> 
>>> >> Hi Ishii-San
>>> >>
>>> >> Sorry for the delayed response.
>>> >
>>> > No problem.
>>> >
>>> >> With the attached fix I guess the failover objects will linger on
>>> forever
>>> >> in case of a false alarm by a health check or small glitch.
>>> >
>>> > That's not good.
>>> >
>>> >> One way to get around the issue could be to compute
>>> >> FAILOVER_COMMAND_FINISH_TIMEOUT based on the maximum value
>>> >> of health_check_peroid across the cluster.
>>> >> something like: failover_command_finish_timouut =
>>> max(health_check_period)
>>> >> * 2 = 60
>>>
>>> After thinking more, I think we need to take account
>>> health_check_max_retries and health_check_retry_delay as
>>> well. i.e. instead of max(health_check_period), something like:
>>> max(health_check_period + (health_check_retry_delay *
>>> health_check_max_retries)).
>>>
>>> What do you think?
>>>
>> 
>> Thanks for the valuable suggestions.
>> Can you try out the attached patch to see if it solves the issue?
> 
> Unfortunately the patch did not pass my test case.
> 
> - 3 watchdog nodes and 2 PostgreSQL servers, streaming replication
>   cluster (created by watchdog_setup). pgpool0 is the watchdog leader.
> 
> - health_check_period = 300, health_check_max_retries = 0
> 
> - pgpool1 starts 120 seconds after pgpool0 starts
> 
> - pgpool2 does not start
> 
> - after watchdog cluster becomes ready, shutdown PostgreSQL node 1 (standby).
> 
> - wait for 600 seconds to expect a failover.
> 
> Unfortunately failover did not happen.
> 
> Attached is the test script and pgpool0 log.
> 
> To run the test:
> 
> - unpack test.tar.gz
> 
> - run prepare.sh
>   $ sh prepare.sh
>   This should create "testdir" directory with 3 watchdog node + PostgreSQL 2 node cluster.
> 
> - cd testdir and run the test
>   $ sh ../start.sg -o 120
>   This will start the test, "-o" specifies how long wait before strating pgpool1.

After the test failure, I examined the pgpool log on the pgpool leader
node (node 0). It seems timeout was not updated as expected.

2022-12-17 08:07:11.419: watchdog pid 707483: LOG:  failover request from 1 nodes with ID:42 is expired
2022-12-17 08:07:11.419: watchdog pid 707483: DETAIL:  marking the failover object for removal. timeout: 15

After looking into the code, I found update_failover_timeout() only
examines "health_check_period".  I think you need to examine
"health_check_period0" etc. as well and find the larget one for the
timeout caliculation.

By the way,

> failover_command_timout
> g_cluster.failover_command_timout

I think "timout" should be "timeout".

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp