[pgpool-hackers: 4245] Re: Issue with failover_require_consensus
Tatsuo Ishii
ishii at sraoss.co.jp
Sat Dec 17 11:11:35 JST 2022
>> On Tue, Nov 29, 2022 at 3:27 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>
>>> >> Hi Ishii-San
>>> >>
>>> >> Sorry for the delayed response.
>>> >
>>> > No problem.
>>> >
>>> >> With the attached fix I guess the failover objects will linger on
>>> forever
>>> >> in case of a false alarm by a health check or small glitch.
>>> >
>>> > That's not good.
>>> >
>>> >> One way to get around the issue could be to compute
>>> >> FAILOVER_COMMAND_FINISH_TIMEOUT based on the maximum value
>>> >> of health_check_peroid across the cluster.
>>> >> something like: failover_command_finish_timouut =
>>> max(health_check_period)
>>> >> * 2 = 60
>>>
>>> After thinking more, I think we need to take account
>>> health_check_max_retries and health_check_retry_delay as
>>> well. i.e. instead of max(health_check_period), something like:
>>> max(health_check_period + (health_check_retry_delay *
>>> health_check_max_retries)).
>>>
>>> What do you think?
>>>
>>
>> Thanks for the valuable suggestions.
>> Can you try out the attached patch to see if it solves the issue?
>
> Unfortunately the patch did not pass my test case.
>
> - 3 watchdog nodes and 2 PostgreSQL servers, streaming replication
> cluster (created by watchdog_setup). pgpool0 is the watchdog leader.
>
> - health_check_period = 300, health_check_max_retries = 0
>
> - pgpool1 starts 120 seconds after pgpool0 starts
>
> - pgpool2 does not start
>
> - after watchdog cluster becomes ready, shutdown PostgreSQL node 1 (standby).
>
> - wait for 600 seconds to expect a failover.
>
> Unfortunately failover did not happen.
>
> Attached is the test script and pgpool0 log.
>
> To run the test:
>
> - unpack test.tar.gz
>
> - run prepare.sh
> $ sh prepare.sh
> This should create "testdir" directory with 3 watchdog node + PostgreSQL 2 node cluster.
>
> - cd testdir and run the test
> $ sh ../start.sg -o 120
> This will start the test, "-o" specifies how long wait before strating pgpool1.
After the test failure, I examined the pgpool log on the pgpool leader
node (node 0). It seems timeout was not updated as expected.
2022-12-17 08:07:11.419: watchdog pid 707483: LOG: failover request from 1 nodes with ID:42 is expired
2022-12-17 08:07:11.419: watchdog pid 707483: DETAIL: marking the failover object for removal. timeout: 15
After looking into the code, I found update_failover_timeout() only
examines "health_check_period". I think you need to examine
"health_check_period0" etc. as well and find the larget one for the
timeout caliculation.
By the way,
> failover_command_timout
> g_cluster.failover_command_timout
I think "timout" should be "timeout".
Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
More information about the pgpool-hackers
mailing list