[pgpool-committers: 3756] Re: pgpool: Fix usage of wait(2) in pgpool main process
Muhammad Usama
m.usama at gmail.com
Thu Jan 5 00:15:52 JST 2017
On Wed, Jan 4, 2017 at 11:39 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> > Hi ishii-San
> >
> > I am looking into the issue
> > http://www.pgpool.net/mantisbt/view.php?id=249, where
> > pgpool-II sometimes does not de-escalations while shutting down. And as
> per
> > the bug report, the issue starts to appear after this commit.
> >
> > Although I am not able to replicate the exact reported issue but It seems
> > like the changes made by this commit can leave the zombie processes.
> >
> > As we are replacing the wait(NULL) with waitpid(,..WNOHANG)
> >
> > @@ -1365,8 +1367,10 @@ static RETSIGTYPE exit_handler(int sig)
> > POOL_SETMASK(&UnBlockSig);
> > do
> > {
> > - wpid = wait(NULL);
> > - }while (wpid > 0 || (wpid == -1 && errno == EINTR));
> > + int ret_pid;
> > + wpid = waitpid(-1, &ret_pid, WNOHANG);
> > + } while (wpid > 0 || (wpid == -1 && errno == EINTR));
> >
> > The problem with this logic is that after replacing the wait(NULL) with
> > waitpid(,..WNOHANG) we can move forward without waiting for all child
> > process to finish, especially if some child process takes a little longer
> > to finish. Since waitpid() returns 0 indicating that there is no
> > exiting process at the moment, even when the child processes exists.
> > For example,
> > at the time of system shutdown, the watchdog process sometimes takes few
> > seconds to execute the de-escalation process before exiting, and
> meanwhile
> > in the main process as soon as waitpid( WNOHANG) would return 0 and the
> > pgpool-II main process exits itself leaving the watchdog process as a
> > zombie.
>
> You are right. I should have not used WNOHANG here. The line should
> have been:
>
> wpid = waitpid(-1, &ret_pid, 0);
>
Thanks for the confirmation. I have committed this change.
Regards
Muhammad Usama
> > Also, is it possible if you can share the scenario where you ran into the
> > infinite wait situation, as there may be some other issue in the code
> since
> > as per the wait() system call documentation it returns -1 when there is
> no
> > child process, so theoretically wait() call should not cause the infinite
> > wait.
>
> Not remember clearly but it maybe the case When a child receives a
> stop signal (SIGSTOP).
>
> > On Thu, Jul 7, 2016 at 11:55 AM, Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> Fix usage of wait(2) in pgpool main process
> >>
> >> Per [pgpool-hackers: 1444]. Here is the copy of the message:
> >>
> >> Hi Usama,
> >>
> >> I have noticed that the usage of wait(2) in pgpool main could cause
> >> infinite wait in the system call.
> >>
> >> /* wait for all children to exit */
> >> do
> >> {
> >> wpid = wait(NULL);
> >> }while (wpid > 0 || (wpid == -1 && errno == EINTR));
> >>
> >> When child process dies, SIGCHLD signal is raised and wait(2) knows
> >> the event. However, multiple child death does not necessarily creates
> >> exact same number of SIGCHLD signal as the number of dead children and
> >> wait(2) could wait for an event which never happens in this case. I
> >> actually encountered this situation while testing pgpool-II. Solution
> >> is, to use waitpid(2) instead of wait(2).
> >>
> >> Branch
> >> ------
> >> master
> >>
> >> Details
> >> -------
> >> http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=
> >> 0d1cdf96feb77de6f1dfc2d46ecd7467325d1f79
> >>
> >> Modified Files
> >> --------------
> >> src/main/pgpool_main.c | 12 ++++++++----
> >> 1 file changed, 8 insertions(+), 4 deletions(-)
> >>
> >> _______________________________________________
> >> pgpool-committers mailing list
> >> pgpool-committers at pgpool.net
> >> http://www.pgpool.net/mailman/listinfo/pgpool-committers
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-committers/attachments/20170104/cdc54354/attachment-0001.html>
More information about the pgpool-committers
mailing list