[pgpool-hackers: 3466] Re: health check timeout does work in certain case

Mon Oct 21 13:55:44 JST 2019

Fix committed in 3.7 and above.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

From: Tatsuo Ishii <ishii at sraoss.co.jp>
Subject: [pgpool-hackers: 3458] health check timeout does work in certain case
Date: Wed, 16 Oct 2019 15:09:10 +0900 (JST)
Message-ID: <20191016.150910.544077412377095882.t-ishii at sraoss.co.jp>

> I have been playing with health check and found that it does not work in certan case.
> 
> I sent SIGSTOP to one of backend node's postmaster process to freeze
> it. I was expecting health check process detects it with health check
> timer expired. However the health check process wait forever here:
> 
> (gdb) bt
> #0  0x00007f094a7a234e in __libc_read (fd=6, 
>     buf=buf at entry=0x564a3aa3a2c0 <readbuf>, nbytes=nbytes at entry=1024)
>     at ../sysdeps/unix/sysv/linux/read.c:27
> #1  0x0000564a3a68dd70 in read (__nbytes=1024, __buf=0x564a3aa3a2c0 <readbuf>, 
>     __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
> #2  pool_read (cp=cp at entry=0x7f094adf2268, buf=buf at entry=0x7fff6c221786, 
>     len=len at entry=1) at utils/pool_stream.c:194
> #3  0x0000564a3a68e101 in pool_read_with_error (cp=0x7f094adf2268, 
>     buf=buf at entry=0x7fff6c221786, len=len at entry=1, 
>     err_context=err_context at entry=0x564a3a700c90 "authentication message response type") at utils/pool_stream.c:141
> #4  0x0000564a3a649761 in connection_do_auth (cp=cp at entry=0x564a3c22a640, 
>     password=password at entry=0x564a3c22a5f0 "md5a16f9d87e344969ec59de417447348b3") at auth/pool_auth.c:104
> #5  0x0000564a3a6565e8 in make_persistent_db_connection (
>     db_node_id=db_node_id at entry=1, 
>     hostname=hostname at entry=0x7f094ae0b280 "/tmp", port=port at entry=11003, 
>     dbname=dbname at entry=0x564a3c21a4a8 "postgres", 
>     user=user at entry=0x564a3c21b7a8 "t-ishii", 
>     password=password at entry=0x564a3c22a5f0 "md5a16f9d87e344969ec59de417447348b3", retry=0 '\000') at protocol/child.c:1440
> #6  0x0000564a3a65670d in make_persistent_db_connection_noerror (
>     db_node_id=db_node_id at entry=1, 
> ---Type <return> to continue, or q <return> to quit
> 
> The stack #2 is here in pool_stream.c:
> 
> 			readlen = read(cp->fd, readbuf, READBUFSZ);
> 
> Actually read(2) was once interrupted by ALARM as expected but later
> on it called read(2) again and stuck there this time because of this
> code.
> 
> 			if (errno == EINTR || errno == EAGAIN)
> 			{
> 				ereport(DEBUG5,
> 						(errmsg("read on socket failed with error :\"%s\"", strerror(errno)),
> 						 errdetail("retrying...")));
> 				continue;
> 			}
> 
> As far as I remember, in all cases except health check read(2) should
> retry and I would like to propose attached patch to fix the
> issue. Comments are welcome.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp