[pgpool-hackers: 196] Re: pgpool health check failsafe mechanism

Thu Apr 11 07:12:29 JST 2013

Thank you Tatsuo. I would say "will go in never ending loop" if any of
slave stop responding (until alive again) as It is been observed earlier
i.e.

pgpool.log

> ....
> 2013-04-04 12:34:41 DEBUG: pid 44263: retrying *10867* th health checking
> 2013-04-04 12:34:41 DEBUG: pid 44263: health_check: 0 th DB node status: 2
> 2013-04-04 12:34:41 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
> support is not available
> 2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: auth kind: 0
> 2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: backend key data received
> 2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: transaction state: I
> 2013-04-04 12:34:41 DEBUG: pid 44263: health_check: 1 th DB node status: 2
> 2013-04-04 12:34:41 ERROR: pid 44263: connect_inet_domain_socket:
> getsockopt() detected error: Connection refused
> 2013-04-04 12:34:41 ERROR: pid 44263: make_persistent_db_connection:
> connection to localhost(7445) failed
> 2013-04-04 12:34:41 ERROR: pid 44263: health check failed. 1 th host
> localhost at port 7445 is down
> 2013-04-04 12:34:41 LOG:   pid 44263: health_check: 1 failover is canceld
> because failover is disallowed
> ....
> ....

AFAIU discussing it with you that it is a feature not a bug. In the
presented scenario, If any of slave got down or missing ( maybe because of
network issue ), until it become available/up again, pgpool will be non
responsive to any new connection (with no warning or message). Do you agree
?. Thanks.

Best Regards,
Asif Naeem

On Tue, Apr 9, 2013 at 5:05 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> Well, "will go in never ending loop" is a little bit incorrect
> statement.  What happens here is, pgpool tries to fail over every
> health_check_period and it is canceled because DISALLOW_TO_FAILOVER
> flag was set. This particular set up has at least two use cases:
>
> - PostgreSQL is protected by heartbeat/pacemaker or any other HA(High
>   Availability software). When a PostgreSQL server fails, they are
>   responsible for taking over the node by the standby PostgreSQL. Once
>   the PostgreSQL comes up, pgpool will start to accept connections
>   from clients.
>
> - Admin wants to upgrade PostgreSQL immediately because of security
>   issues with it (like recent PostgreSQL). He stops PostgreSQL one by
>   one and upgrades them. While admin stops PostgreSQL, pgpool refuses
>   to accept connections from clients and database consistency among
>   database nodes are safely kept. This will make minimize the down
>   time.
>
> In summary, I see no point to change current behavior of pgpool.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Hi Tatsuo Ishii,
> >
> > By looking at the source code, It seems that health check mechanism
> depends
> > on failover option (fail_over_on_backend_error + backend_flag) for non
> > parallel mode and will go in never ending loop if failover is disabled
> (As
> > I mentioned earlier on Issue#3 in first email) i.e.
> >
> > pgpool2/main.c
> >
> >> /* do we need health checking for PostgreSQL? */
> >> if (pool_config->health_check_period > 0)
> >> {
> >> ...
> >> ...
> >> if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(sts).flag))
> >> {
> >>      pool_log("health_check: %d failover is canceld because failover is
> >> disallowed", sts);
> >> }
> >> else if (retrycnt <= pool_config->health_check_max_retries)
> >> ...
> >> ...
> >> }
> >
> >
> > It seems failover depend on configuration option not only
> > fail_over_on_backend_error but as well as backend_flag too. If
> > fail_over_on_backend_error is "on" but backend_flag is
> > "DISALLOW_TO_FAILOVER" it will not trigger fail over for related slave
> > node. On the other hand If child process find an error in connection for
> > any related node it aborts. As you suggested earlier It seems the only
> > appropriate thing that should be done is failover and restart all child
> > processes, if error in connection to any related node found.
> >
> > In the example (Issue#3 in first email) I mentioned earlier there is dead
> > end and pgpool goes in endless loop and become non responsive for new
> > connections if we use following configuration settings i.e.
> >
> > pgpool.conf
> >
> >> fail_over_on_backend_error  = on
> >> backend_flag0 = 'DISALLOW_TO_FAILOVER'
> >> backend_flag1 = 'DISALLOW_TO_FAILOVER'
> >> health_check_period = 5
> >> health_check_timeout = 1
> >> health_check_retry_delay = 10
> >
> >
> > On each new
> > connection
> new_connection()->notice_backend_error()->degenerate_backend_set()
> > give the following warning i.e.
> >
> > if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(node_id_set[i]).flag))
> >> {
> >>      pool_log("degenerate_backend_set: %d failover request from pid %d
> is
> >> canceld because failover is disallowed", node_id_set[i], getpid());
> >>      continue;
> >> }
> >
> >
> > As mentioned in the fail_over_on_backend_error documentation, failover
> can
> > happen even when fail_over_on_backend_error=off when it detects
> > administrative shutdown of postmaster i.e.
> >
> > http://www.pgpool.net/docs/latest/pgpool-en.html
> >
> >> fail_over_on_backend_error V2.3 -
> >> If true, and an error occurs when reading/writing to the backend
> >> communication, pgpool-II will trigger the fail over procedure. If set to
> >> false, pgpool will report an error and disconnect the session. If you
> set
> >> this parameter to off, it is recommended that you turn on health
> checking.
> >> Please note that even if this parameter is set to off, however, pgpool
> will
> >> also do the fail over when pgpool detects the administrative shutdown of
> >> postmaster.
> >> You need to reload pgpool.conf if you change this value.
> >
> >
> > If failover/degenerate is only option to handle the situation where slave
> > node is non responsive/crashed etc, can't it be allowed in the code to do
> > failover on connection error (even when it is disabled) ?. Thanks.
> >
> > Best Regards,
> > Asif Naeem
> >
> > On Wed, Apr 3, 2013 at 11:43 AM, Asif Naeem <anaeem.it at gmail.com> wrote:
> >
> >> Hi,
> >>
> >> We are facing issue with pgpool health check failsafe mechanism in
> >> production environment. I have previously posted this issue on
> >> http://www.pgpool.net/mantisbt/view.php?id=50. I have observed 2 issue
> >> with gpool-II version 3.2.3 (built with latest source code) i.e.
> >>
> >> Used versions i.e.
> >>
> >>> pgpool-II version 3.2.3
> >>> postgresql 9.2.3 (Master + Slave)
> >>
> >>
> >> 1. In master slave configuration, if health check and failover is
> enabled
> >> i.e.
> >>
> >> pgpool.conf
> >>
> >>> backend_flag0 = 'ALLOW_TO_FAILOVER'
> >>> backend_flag1 = 'ALLOW_TO_FAILOVER'
> >>>
> >> health_check_period = 5
> >>> health_check_timeout = 1
> >>> health_check_max_retries = 2
> >>> health_check_retry_delay = 10
> >>
> >> load_balance_mode = off
> >>
> >>
> >> On Linux64, When master server is running fine and without load
> balancing
> >> and when suddenly if network interruption happen or any other reason (I
> >> mimic the situation via forcefully shutdown dbserver via immediate mode
> >> etc) and pgpool is not able to make connection to slave server. After
> that
> >> first connection attempt to pgpool return without error/warning message
> and
> >> pgpool do fail over and kill all child processes. Does that make sense
> that
> >> when there is no load balancing and master dbserver is serving the
> queries
> >> well and disconnection of slave server trigger failover ?.
> >>
> >> pgpool.log
> >>
> >>> ....
> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: I am 65431 accept fd 6
> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: read_startup_packet:
> >>> application_name: psql
> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: Protocol Major: 3 Minor: 0
> >>> database: postgres user: asif
> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 0
> backend
> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 1
> backend
> >>> 2013-04-02 17:24:36 ERROR: pid 65431: connect_inet_domain_socket:
> >>> getsockopt() detected error: Connection refused
> >>> 2013-04-02 17:24:36 ERROR: pid 65431: connection to localhost(7445)
> failed
> >>> 2013-04-02 17:24:36 ERROR: pid 65431: new_connection: create_cp()
> failed
> >>> 2013-04-02 17:24:36 LOG:   pid 65431: degenerate_backend_set: 1 fail
> over
> >>> request from pid 65431
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler called
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: starting to
> >>> select new master node
> >>> 2013-04-02 17:24:36 LOG:   pid 65417: starting degeneration. shutdown
> >>> host localhost(7445)
> >>> 2013-04-02 17:24:36 LOG:   pid 65417: Restart all children
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65418
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65419
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65420
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65421
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65422
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65423
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65424
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65425
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65426
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65427
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65428
> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65429
> >>> ...
> >>> ...
> >>
> >>
> >> 2. In the same previous configuration, If I disable failover i.e.
> >>
> >> pgpool.conf
> >>
> >>> backend_flag0 = 'DISALLOW_TO_FAILOVER'
> >>> backend_flag1 = 'DISALLOW_TO_FAILOVER'
> >>>
> >> health_check_period = 5
> >>> health_check_timeout = 1
> >>> health_check_max_retries = 2
> >>> health_check_retry_delay = 10
> >>
> >> load_balance_mode = off
> >>
> >>
> >> On Linux64, When master server is running fine and there is no load
> >> balancing and no failover and suddenly slave server appear to be
> >> disconnected because of network interruption happen or any other reason
> (I
> >> mimic it by forcefully shutdown dbserver via immediate mode etc). After
> >> that no connection attempt got successful to pgpool until health check
> >> complete and master database server log shows the following messages
> i.e.
> >>
> >> dbserver.log
> >>   ...
> >>   ...
> >>   LOG: incomplete startup packet
> >>   LOG: incomplete startup packet
> >>   LOG: incomplete startup packet
> >>   LOG: incomplete startup packet
> >>   LOG: incomplete startup packet
> >>   ...
> >>
> >> 3. While testing this scenario on my MacOSX machine (gcc), it seems that
> >> health check is not getting complete and endless with pgpool
> configuration
> >> settings as issue #2 above and it completely refrain me from to to
> connect
> >> pgpool any more i.e.
> >>
> >> pgpool.log
> >>
> >>> ...
> >>> ...
> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: retrying *679* th health checking
> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 0 th DB node
> status: 2
> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
> >>> support is not available
> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: auth kind: 0
> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: backend key data
> received
> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: transaction state: I
> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 1 th DB node
> status: 2
> >>> 2013-04-03 11:29:29 ERROR: pid 44263: connect_inet_domain_socket:
> >>> getsockopt() detected error: Connection refused
> >>> 2013-04-03 11:29:29 ERROR: pid 44263: make_persistent_db_connection:
> >>> connection to localhost(7445) failed
> >>> 2013-04-03 11:29:29 ERROR: pid 44263: health check failed. 1 th host
> >>> localhost at port 7445 is down
> >>> 2013-04-03 11:29:29 LOG:   pid 44263: health_check: 1 failover is
> canceld
> >>> because failover is disallowed
> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: retrying *680* th health checking
> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 0 th DB node
> status: 2
> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
> >>> support is not available
> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: auth kind: 0
> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: backend key data
> received
> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: transaction state: I
> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 1 th DB node
> status: 2
> >>> 2013-04-03 11:29:34 ERROR: pid 44263: connect_inet_domain_socket:
> >>> getsockopt() detected error: Connection refused
> >>> 2013-04-03 11:29:34 ERROR: pid 44263: make_persistent_db_connection:
> >>> connection to localhost(7445) failed
> >>> 2013-04-03 11:29:34 ERROR: pid 44263: health check failed. 1 th host
> >>> localhost at port 7445 is down
> >>> 2013-04-03 11:29:34 LOG:   pid 44263: health_check: 1 failover is
> canceld
> >>> because failover is disallowed
> >>> ...
> >>> ...
> >>
> >>
> >> I will try it on Linux64 machine too. Thanks.
> >>
> >> Best Regards,
> >> Asif Naeem
> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20130411/d6b199cb/attachment-0001.html>