[pgpool-committers: 3745] Re: pgpool: Fix usage of wait(2) in pgpool main process
Tatsuo Ishii
ishii at sraoss.co.jp
Wed Jan 4 15:39:29 JST 2017
> Hi ishii-San
>
> I am looking into the issue
> http://www.pgpool.net/mantisbt/view.php?id=249, where
> pgpool-II sometimes does not de-escalations while shutting down. And as per
> the bug report, the issue starts to appear after this commit.
>
> Although I am not able to replicate the exact reported issue but It seems
> like the changes made by this commit can leave the zombie processes.
>
> As we are replacing the wait(NULL) with waitpid(,..WNOHANG)
>
> @@ -1365,8 +1367,10 @@ static RETSIGTYPE exit_handler(int sig)
> POOL_SETMASK(&UnBlockSig);
> do
> {
> - wpid = wait(NULL);
> - }while (wpid > 0 || (wpid == -1 && errno == EINTR));
> + int ret_pid;
> + wpid = waitpid(-1, &ret_pid, WNOHANG);
> + } while (wpid > 0 || (wpid == -1 && errno == EINTR));
>
> The problem with this logic is that after replacing the wait(NULL) with
> waitpid(,..WNOHANG) we can move forward without waiting for all child
> process to finish, especially if some child process takes a little longer
> to finish. Since waitpid() returns 0 indicating that there is no
> exiting process at the moment, even when the child processes exists.
> For example,
> at the time of system shutdown, the watchdog process sometimes takes few
> seconds to execute the de-escalation process before exiting, and meanwhile
> in the main process as soon as waitpid( WNOHANG) would return 0 and the
> pgpool-II main process exits itself leaving the watchdog process as a
> zombie.
You are right. I should have not used WNOHANG here. The line should
have been:
wpid = waitpid(-1, &ret_pid, 0);
> Also, is it possible if you can share the scenario where you ran into the
> infinite wait situation, as there may be some other issue in the code since
> as per the wait() system call documentation it returns -1 when there is no
> child process, so theoretically wait() call should not cause the infinite
> wait.
Not remember clearly but it maybe the case When a child receives a
stop signal (SIGSTOP).
> On Thu, Jul 7, 2016 at 11:55 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>
>> Fix usage of wait(2) in pgpool main process
>>
>> Per [pgpool-hackers: 1444]. Here is the copy of the message:
>>
>> Hi Usama,
>>
>> I have noticed that the usage of wait(2) in pgpool main could cause
>> infinite wait in the system call.
>>
>> /* wait for all children to exit */
>> do
>> {
>> wpid = wait(NULL);
>> }while (wpid > 0 || (wpid == -1 && errno == EINTR));
>>
>> When child process dies, SIGCHLD signal is raised and wait(2) knows
>> the event. However, multiple child death does not necessarily creates
>> exact same number of SIGCHLD signal as the number of dead children and
>> wait(2) could wait for an event which never happens in this case. I
>> actually encountered this situation while testing pgpool-II. Solution
>> is, to use waitpid(2) instead of wait(2).
>>
>> Branch
>> ------
>> master
>>
>> Details
>> -------
>> http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=
>> 0d1cdf96feb77de6f1dfc2d46ecd7467325d1f79
>>
>> Modified Files
>> --------------
>> src/main/pgpool_main.c | 12 ++++++++----
>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> _______________________________________________
>> pgpool-committers mailing list
>> pgpool-committers at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-committers
>>
More information about the pgpool-committers
mailing list