[pgpool-hackers: 4249] Re: Issue with failover_require_consensus

Muhammad Usama muhammad.usama at percona.com
Mon Dec 19 16:40:38 JST 2022


Hi
Sorry for the inconvenience. I guess I attached the wrong patch file
previously.
Please find the latest one that also changes the DEBUG to LOG, as suggested.

Thanks
best Regards
Muhammad Usama


On Mon, Dec 19, 2022 at 9:07 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > Hi Ishii San
> >
> > Thanks for the feedback. Please find the updated patch that takes care of
> > transferring the per node health check params through WD_POOL_CONFIG_DATA
> > and use those in calculations.
> > Could you verify if the attached fixes the issue?
>
> Sure.
>
> The patch adds trailing spaces.
>
> /home/t-ishii/fix_failover_command_timout_v2.diff:68: trailing whitespace.
>                         int pn_failover_command_timeout =
> pool_config->health_check_params[i].health_check_period +
> /home/t-ishii/fix_failover_command_timout_v2.diff:69: trailing whitespace.
>
> (pool_config->health_check_params[i].health_check_retry_delay *
> /home/t-ishii/fix_failover_command_timout_v2.diff:118: trailing whitespace.
>                         health_check_params_count =
> config->backend_desc->num_backends;
> warning: 3 lines add whitespace errors.
>
> There is a compile error:
> watchdog.c:8203:10: error: ‘DEBUG’ undeclared (first use in this
> function); did you mean ‘DEBUG1’?
>  8203 |  ereport(DEBUG,(errmsg("Setting failover command timeout to
> %d",failover_command_timeout)));
>       |          ^~~~~
>
> We'd better to use "LOG", instead DEBUG* here? Because:
> - the log message is not frequent
> - the timeout value is an important information
>
> What do you think?
>
> > As for standard_packet_processor() is concerned, it Is intended to handle
> > messages
> > that do not require to be handled differently for each watchdog state.
>
> Ok. Thank you for the explanation.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Thanks
> > Best Regards
> > Muhammad Usama
> >
> >
> > On Sat, Dec 17, 2022 at 5:01 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> >
> >> >>> On Tue, Nov 29, 2022 at 3:27 AM Tatsuo Ishii <ishii at sraoss.co.jp>
> >> wrote:
> >> >>>
> >> >>>> >> Hi Ishii-San
> >> >>>> >>
> >> >>>> >> Sorry for the delayed response.
> >> >>>> >
> >> >>>> > No problem.
> >> >>>> >
> >> >>>> >> With the attached fix I guess the failover objects will linger
> on
> >> >>>> forever
> >> >>>> >> in case of a false alarm by a health check or small glitch.
> >> >>>> >
> >> >>>> > That's not good.
> >> >>>> >
> >> >>>> >> One way to get around the issue could be to compute
> >> >>>> >> FAILOVER_COMMAND_FINISH_TIMEOUT based on the maximum value
> >> >>>> >> of health_check_peroid across the cluster.
> >> >>>> >> something like: failover_command_finish_timouut =
> >> >>>> max(health_check_period)
> >> >>>> >> * 2 = 60
> >> >>>>
> >> >>>> After thinking more, I think we need to take account
> >> >>>> health_check_max_retries and health_check_retry_delay as
> >> >>>> well. i.e. instead of max(health_check_period), something like:
> >> >>>> max(health_check_period + (health_check_retry_delay *
> >> >>>> health_check_max_retries)).
> >> >>>>
> >> >>>> What do you think?
> >> >>>>
> >> >>>
> >> >>> Thanks for the valuable suggestions.
> >> >>> Can you try out the attached patch to see if it solves the issue?
> >> >>
> >> >> Unfortunately the patch did not pass my test case.
> >> >>
> >> >> - 3 watchdog nodes and 2 PostgreSQL servers, streaming replication
> >> >>   cluster (created by watchdog_setup). pgpool0 is the watchdog
> leader.
> >> >>
> >> >> - health_check_period = 300, health_check_max_retries = 0
> >> >>
> >> >> - pgpool1 starts 120 seconds after pgpool0 starts
> >> >>
> >> >> - pgpool2 does not start
> >> >>
> >> >> - after watchdog cluster becomes ready, shutdown PostgreSQL node 1
> >> (standby).
> >> >>
> >> >> - wait for 600 seconds to expect a failover.
> >> >>
> >> >> Unfortunately failover did not happen.
> >> >>
> >> >> Attached is the test script and pgpool0 log.
> >> >>
> >> >> To run the test:
> >> >>
> >> >> - unpack test.tar.gz
> >> >>
> >> >> - run prepare.sh
> >> >>   $ sh prepare.sh
> >> >>   This should create "testdir" directory with 3 watchdog node +
> >> PostgreSQL 2 node cluster.
> >> >>
> >> >> - cd testdir and run the test
> >> >>   $ sh ../start.sg -o 120
> >> >>   This will start the test, "-o" specifies how long wait before
> >> strating pgpool1.
> >> >
> >> > After the test failure, I examined the pgpool log on the pgpool leader
> >> > node (node 0). It seems timeout was not updated as expected.
> >> >
> >> > 2022-12-17 08:07:11.419: watchdog pid 707483: LOG:  failover request
> >> from 1 nodes with ID:42 is expired
> >> > 2022-12-17 08:07:11.419: watchdog pid 707483: DETAIL:  marking the
> >> failover object for removal. timeout: 15
> >> >
> >> > After looking into the code, I found update_failover_timeout() only
> >> > examines "health_check_period".  I think you need to examine
> >> > "health_check_period0" etc. as well and find the larget one for the
> >> > timeout caliculation.
> >> >
> >> > By the way,
> >> >
> >> >> failover_command_timout
> >> >> g_cluster.failover_command_timout
> >> >
> >> > I think "timout" should be "timeout".
> >>
> >> I was trying to create a proof of concept patch for this:
> >>
> >> > After looking into the code, I found update_failover_timeout() only
> >> > examines "health_check_period".  I think you need to examine
> >> > "health_check_period0" etc. as well and find the larget one for the
> >> > timeout caliculation.
> >>
> >> and noticed that update_failover_timeout() is called by
> >> standard_packet_processor() when WD_POOL_CONFIG_DATA packet is
> >> received like: update_failover_timeout(wdNode, standby_config); But
> >> standby_config was created by get_pool_config_from_json() which does
> >> not seem to create health_check_params in pool_config data. Also I
> >> wonder why standard_packet_processor() needs to be called when
> >> WD_POOL_CONFIG_DATA is recieved. Can't we simply omit the call to
> >> update_failover_timeout() in this case?
> >>
> >> Best reagards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS LLC
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20221219/24dac875/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_failover_command_timout_v3.diff
Type: application/octet-stream
Size: 7404 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20221219/24dac875/attachment-0001.obj>


More information about the pgpool-hackers mailing list