Here is the log file and strace output file. Timings configured are 30sec health check interval, 5sec timeout, and 2 retries with 10sec retry delay.<br><br>It takes a lot more than 5sec from started health check to sleeping 10sec for first retry.<br>

<br>Seen in code (main.x, health_check() function), within (retry) attempt there is inner retry (first with postgres database then with template1) and that part doesn't seem to be interrupted by alarm.<br>


<br>Regards,<br>Stevo.<br><br><div class="gmail_quote">2012/1/11 Tatsuo Ishii <span dir="ltr"><<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Ok, I will do it. In the mean time you could use "strace -tt -p PID"<br>

to see which system call is blocked.<br>

<div><div>--<br>

Tatsuo Ishii<br>

SRA OSS, Inc. Japan<br>

English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>

<br>

> OK, got the info - key point is that ip forwarding is disabled for security<br>

> reasons. Rules in iptables are not important, iptables can be stopped, or<br>

> previously added rules removed.<br>

><br>

> Here are the steps to reproduce (kudos to my colleague Nenad Bulatovic for<br>

> providing this):<br>

><br>

> 1.) make sure that ip forwarding is off:<br>

>     echo 0 > /proc/sys/net/ipv4/ip_forward<br>

> 2.) create IP alias on some interface (and have postgres listen on it):<br>

>     ip addr add x.x.x.x/yy dev ethz<br>

> 3.) set backend_hostname0 to aforementioned IP<br>

> 4.) start pgpool and monitor health checks<br>

> 5.) remove IP alias:<br>

>     ip addr del x.x.x.x/yy dev ethz<br>

><br>

><br>

> Here is the interesting part in pgpool log after this:<br>

> 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking<br>

> 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB node status: 2<br>

> 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB node status: 1<br>

> 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking<br>

> 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB node status: 2<br>

> 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB node status: 2<br>

> 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0 th host<br>

> 192.168.2.27 at port 5432 is down<br>

> 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep time: 10<br>

> second(s)<br>

><br>

> That pgpool was configured with health check interval of 30sec, 5sec<br>

> timeout, and 10sec retry delay with 2 max retries.<br>

><br>

> Making use of libpq instead for connecting to db in health checks IMO<br>

> should resolve it, but you'll best determine which call exactly gets<br>

> blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var configured<br>

> respects that env var timeout.<br>

><br>

> Regards,<br>

> Stevo.<br>

><br>

> On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <<a href="mailto:sslavic@gmail.com" target="_blank">sslavic@gmail.com</a>> wrote:<br>

><br>

>> Tatsuo,<br>

>><br>

>> Did you restart iptables after adding rule?<br>

>><br>

>> Regards,<br>

>> Stevo.<br>

>><br>

>><br>

>> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <<a href="mailto:sslavic@gmail.com" target="_blank">sslavic@gmail.com</a>> wrote:<br>

>><br>

>>> Looking into this to verify if these are all necessary changes to have<br>

>>> port unreachable message silently rejected (suspecting some kernel<br>

>>> parameter tuning is needed).<br>

>>><br>

>>> Just to clarify it's not a problem that host is being detected by pgpool<br>

>>> to be down, but the timing when that happens. On environment where issue is<br>

>>> reproduced pgpool as part of health check attempt tries to connect to<br>

>>> backend and hangs for tcp timeout instead of being interrupted by timeout<br>

>>> alarm. Can you verify/confirm please the health check retry timings are not<br>

>>> delayed?<br>

>>><br>

>>> Regards,<br>

>>> Stevo.<br>

>>><br>

>>><br>

>>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>>wrote:<br>

>>><br>

>>>> Ok, I did:<br>

>>>><br>

>>>> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable<br>

>>>><br>

>>>> on the host where pgpoo is running. And pull network cable from<br>

>>>> backend0 host network interface. Pgpool detected the host being down<br>

>>>> as expected...<br>

>>>> --<br>

>>>> Tatsuo Ishii<br>

>>>> SRA OSS, Inc. Japan<br>

>>>> English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

>>>> Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>

>>>><br>

>>>> > Backend is not destination of this message, pgpool host is, and we<br>

>>>> don't<br>

>>>> > want it to ever get it. With command I've sent you rule will be<br>

>>>> created for<br>

>>>> > any source and destination.<br>

>>>> ><br>

>>>> > Regards,<br>

>>>> > Stevo.<br>

>>>> ><br>

>>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>><br>

>>>> wrote:<br>

>>>> ><br>

>>>> >> I did following:<br>

>>>> >><br>

>>>> >> Do following on the host where pgpool is running on:<br>

>>>> >><br>

>>>> >> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable -d<br>

>>>> >> 133.137.177.124<br>

>>>> >> (133.137.177.124 is the host where backend is running on)<br>

>>>> >><br>

>>>> >> Pull network cable from backend0 host network interface. Pgpool<br>

>>>> >> detected the host being down as expected. Am I missing something?<br>

>>>> >> --<br>

>>>> >> Tatsuo Ishii<br>

>>>> >> SRA OSS, Inc. Japan<br>

>>>> >> English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

>>>> >> Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>

>>>> >><br>

>>>> >> > Hello Tatsuo,<br>

>>>> >> ><br>

>>>> >> > With backend0 on one host just configure following rule on other<br>

>>>> host<br>

>>>> >> where<br>

>>>> >> > pgpool is:<br>

>>>> >> ><br>

>>>> >> > iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable<br>

>>>> >> ><br>

>>>> >> > and then have pgpool startup with health checking and retrying<br>

>>>> >> configured,<br>

>>>> >> > and then pull network cable from backend0 host network interface.<br>

>>>> >> ><br>

>>>> >> > Regards,<br>

>>>> >> > Stevo.<br>

>>>> >> ><br>

>>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a><br>

>>>> ><br>

>>>> >> wrote:<br>

>>>> >> ><br>

>>>> >> >> I want to try to test the situation you descrived:<br>

>>>> >> >><br>

>>>> >> >> >> > When system is configured for security reasons not to return<br>

>>>> >> >> destination<br>

>>>> >> >> >> > host unreachable messages, even though health_check_timeout is<br>

>>>> >> >><br>

>>>> >> >> But I don't know how to do it. I pulled out the network cable and<br>

>>>> >> >> pgpool detected it as expected. Also I configured the server which<br>

>>>> >> >> PostgreSQL is running on to disable the 5432 port. In this case<br>

>>>> >> >> connect(2) returned EHOSTUNREACH (No route to host) so pgpool<br>

>>>> detected<br>

>>>> >> >> the error as expected.<br>

>>>> >> >><br>

>>>> >> >> Could you please instruct me?<br>

>>>> >> >> --<br>

>>>> >> >> Tatsuo Ishii<br>

>>>> >> >> SRA OSS, Inc. Japan<br>

>>>> >> >> English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

>>>> >> >> Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>

>>>> >> >><br>

>>>> >> >> > Hello Tatsuo,<br>

>>>> >> >> ><br>

>>>> >> >> > Thank you for replying!<br>

>>>> >> >> ><br>

>>>> >> >> > I'm not sure what exactly is blocking, just by pgpool code<br>

>>>> analysis I<br>

>>>> >> >> > suspect it is the part where a connection is made to the db and<br>

>>>> it<br>

>>>> >> >> doesn't<br>

>>>> >> >> > seem to get interrupted by alarm. Tested thoroughly health check<br>

>>>> >> >> behaviour,<br>

>>>> >> >> > it works really well when host/ip is there and just<br>

>>>> backend/postgres<br>

>>>> >> is<br>

>>>> >> >> > down, but not when backend host/ip is down. I could see in log<br>

>>>> that<br>

>>>> >> >> initial<br>

>>>> >> >> > health check and each retry got delayed when host/ip is not<br>

>>>> reachable,<br>

>>>> >> >> > while when just backend is not listening (is down) on the<br>

>>>> reachable<br>

>>>> >> >> host/ip<br>

>>>> >> >> > then initial health check and all retries are exact to the<br>

>>>> settings in<br>

>>>> >> >> > pgpool.conf.<br>

>>>> >> >> ><br>

>>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq environment<br>

>>>> variables<br>

>>>> >> in<br>

>>>> >> >> > the docs (see<br>

>>>> >> >> <a href="http://www.postgresql.org/docs/9.1/static/libpq-envars.html" target="_blank">http://www.postgresql.org/docs/9.1/static/libpq-envars.html</a> )<br>

>>>> >> >> > There is equivalent parameter in libpq PGconnectdbParams ( see<br>

>>>> >> >> ><br>

>>>> >> >><br>

>>>> >><br>

>>>> <a href="http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT" target="_blank">http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT</a><br>


>>>> >> >> )<br>

>>>> >> >> > At the beginning of that same page there are some important<br>

>>>> infos on<br>

>>>> >> >> using<br>

>>>> >> >> > these functions.<br>

>>>> >> >> ><br>

>>>> >> >> > psql respects PGCONNECT_TIMEOUT.<br>

>>>> >> >> ><br>

>>>> >> >> > Regards,<br>

>>>> >> >> > Stevo.<br>

>>>> >> >> ><br>

>>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <<br>

>>>> <a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>><br>

>>>> >> >> wrote:<br>

>>>> >> >> ><br>

>>>> >> >> >> > Hello pgpool community,<br>

>>>> >> >> >> ><br>

>>>> >> >> >> > When system is configured for security reasons not to return<br>

>>>> >> >> destination<br>

>>>> >> >> >> > host unreachable messages, even though health_check_timeout is<br>

>>>> >> >> >> configured,<br>

>>>> >> >> >> > socket call will block and alarm will not get raised until TCP<br>

>>>> >> timeout<br>

>>>> >> >> >> > occurs.<br>

>>>> >> >> >><br>

>>>> >> >> >> Interesting. So are you saying that read(2) cannot be<br>

>>>> interrupted by<br>

>>>> >> >> >> alarm signal if the system is configured not to return<br>

>>>> destination<br>

>>>> >> >> >> host unreachable message? Could you please guide me where I can<br>

>>>> get<br>

>>>> >> >> >> such that info? (I'm not a network expert).<br>

>>>> >> >> >><br>

>>>> >> >> >> > Not a C programmer, found some info that select call could be<br>

>>>> >> replace<br>

>>>> >> >> >> with<br>

>>>> >> >> >> > select/pselect calls. Maybe it would be best if<br>

>>>> PGCONNECT_TIMEOUT<br>

>>>> >> >> value<br>

>>>> >> >> >> > could be used here for connection timeout. pgpool has libpq as<br>

>>>> >> >> >> dependency,<br>

>>>> >> >> >> > why isn't it using libpq for the healthcheck db connect<br>

>>>> calls, then<br>

>>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?<br>

>>>> >> >> >><br>

>>>> >> >> >> I don't think libpq uses select/pselect for establishing<br>

>>>> connection,<br>

>>>> >> >> >> but using libpq instead of homebrew code seems to be an idea.<br>

>>>> Let me<br>

>>>> >> >> >> think about it.<br>

>>>> >> >> >><br>

>>>> >> >> >> One question. Are you sure that libpq can deal with the case<br>

>>>> (not to<br>

>>>> >> >> >> return destination host unreachable messages) by using<br>

>>>> >> >> >> PGCONNECT_TIMEOUT?<br>

>>>> >> >> >> --<br>

>>>> >> >> >> Tatsuo Ishii<br>

>>>> >> >> >> SRA OSS, Inc. Japan<br>

>>>> >> >> >> English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

>>>> >> >> >> Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>

>>>> >> >> >><br>

>>>> >> >><br>

>>>> >><br>

>>>><br>

>>><br>

>>><br>

>><br>

</div></div></blockquote></div><br>