[pgpool-hackers: 4246] Re: Issue with failover_require_consensus
Tatsuo Ishii
ishii at sraoss.co.jp
Sat Dec 17 21:01:19 JST 2022
>>> On Tue, Nov 29, 2022 at 3:27 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>>
>>>> >> Hi Ishii-San
>>>> >>
>>>> >> Sorry for the delayed response.
>>>> >
>>>> > No problem.
>>>> >
>>>> >> With the attached fix I guess the failover objects will linger on
>>>> forever
>>>> >> in case of a false alarm by a health check or small glitch.
>>>> >
>>>> > That's not good.
>>>> >
>>>> >> One way to get around the issue could be to compute
>>>> >> FAILOVER_COMMAND_FINISH_TIMEOUT based on the maximum value
>>>> >> of health_check_peroid across the cluster.
>>>> >> something like: failover_command_finish_timouut =
>>>> max(health_check_period)
>>>> >> * 2 = 60
>>>>
>>>> After thinking more, I think we need to take account
>>>> health_check_max_retries and health_check_retry_delay as
>>>> well. i.e. instead of max(health_check_period), something like:
>>>> max(health_check_period + (health_check_retry_delay *
>>>> health_check_max_retries)).
>>>>
>>>> What do you think?
>>>>
>>>
>>> Thanks for the valuable suggestions.
>>> Can you try out the attached patch to see if it solves the issue?
>>
>> Unfortunately the patch did not pass my test case.
>>
>> - 3 watchdog nodes and 2 PostgreSQL servers, streaming replication
>> cluster (created by watchdog_setup). pgpool0 is the watchdog leader.
>>
>> - health_check_period = 300, health_check_max_retries = 0
>>
>> - pgpool1 starts 120 seconds after pgpool0 starts
>>
>> - pgpool2 does not start
>>
>> - after watchdog cluster becomes ready, shutdown PostgreSQL node 1 (standby).
>>
>> - wait for 600 seconds to expect a failover.
>>
>> Unfortunately failover did not happen.
>>
>> Attached is the test script and pgpool0 log.
>>
>> To run the test:
>>
>> - unpack test.tar.gz
>>
>> - run prepare.sh
>> $ sh prepare.sh
>> This should create "testdir" directory with 3 watchdog node + PostgreSQL 2 node cluster.
>>
>> - cd testdir and run the test
>> $ sh ../start.sg -o 120
>> This will start the test, "-o" specifies how long wait before strating pgpool1.
>
> After the test failure, I examined the pgpool log on the pgpool leader
> node (node 0). It seems timeout was not updated as expected.
>
> 2022-12-17 08:07:11.419: watchdog pid 707483: LOG: failover request from 1 nodes with ID:42 is expired
> 2022-12-17 08:07:11.419: watchdog pid 707483: DETAIL: marking the failover object for removal. timeout: 15
>
> After looking into the code, I found update_failover_timeout() only
> examines "health_check_period". I think you need to examine
> "health_check_period0" etc. as well and find the larget one for the
> timeout caliculation.
>
> By the way,
>
>> failover_command_timout
>> g_cluster.failover_command_timout
>
> I think "timout" should be "timeout".
I was trying to create a proof of concept patch for this:
> After looking into the code, I found update_failover_timeout() only
> examines "health_check_period". I think you need to examine
> "health_check_period0" etc. as well and find the larget one for the
> timeout caliculation.
and noticed that update_failover_timeout() is called by
standard_packet_processor() when WD_POOL_CONFIG_DATA packet is
received like: update_failover_timeout(wdNode, standby_config); But
standby_config was created by get_pool_config_from_json() which does
not seem to create health_check_params in pool_config data. Also I
wonder why standard_packet_processor() needs to be called when
WD_POOL_CONFIG_DATA is recieved. Can't we simply omit the call to
update_failover_timeout() in this case?
Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
More information about the pgpool-hackers
mailing list