<div dir="ltr"><div><div><div><div><div><div><div><div><div>Hi Yugo,<br></div>Sorry for the late reply as I was busy on other project.<br></div>Actually, i cannot reproduce the problem anymore after I take out the failback,py on both pgpool node. <br>

</div>Now I can failover, and recover the failed node at will.<br><br></div>However, when I move on to the next test case. Shut down the Primary server completely (execute reboot on the Primary server), failover failed.<br>

</div>Looking at the log on the Standby, i see that the failover script is being called, however, the parameter %H is missing for failover command.<br><br>Mar 29 22:52:02 server0 pgpool[35742]: execute command: /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data -m -1 -H  -M 0 -P 0 -r  -R<br>

Fri Mar 29 22:52:02 2013 failover  DEBUG:  --><br>Fri Mar 29 22:52:02 2013 failover  DEBUG:  Invalid node ID<br>Fri Mar 29 22:52:02 2013 failover  DEBUG:  <--<br><br></div>the option are equivalent to the parameter listed in the pgpool.conf.<br>

</div>In this case, -H value is missing which is %H = hostname of the new master node.<br><br></div>Do you have any idea why?<br><br></div>Thanks~<br>Ning<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">

On Thu, Mar 21, 2013 at 4:43 AM, Yugo Nagata <span dir="ltr"><<a href="mailto:nagata@sraoss.co.jp" target="_blank">nagata@sraoss.co.jp</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi ning,<br>

<br>

The samples of pgpool.conf and scripts(failover.sh, recovery_1st_stage<br>

and pgpool_remote_start) are available in the following document.<br>

<br>

"pgpool-II Tutorial [watchdog in master-slave mode]".<br>

<a href="http://www.pgpool.net/pgpool-web/contrib_docs/watchdog_master_slave/en.html" target="_blank">http://www.pgpool.net/pgpool-web/contrib_docs/watchdog_master_slave/en.html</a><br>

<br>

Could you please try to reproduce the problem with using the scripts?<br>

<br>

You would have to edit the scripts because some variables such as port<br>

number and install directory are hard coded in these.<br>

<div class="HOEnZb"><div class="h5"><br>

On Tue, 19 Mar 2013 12:57:57 -0500<br>

ning chan <<a href="mailto:ninchan8328@gmail.com">ninchan8328@gmail.com</a>> wrote:<br>

<br>

> HI Yugo,<br>

><br>

> You are correct, failover.py is just simply used to detect if the current<br>

> node will be the NEW Primary, and it will create the trigger file if it is.<br>

> The failback.py is used to failback the failed server by pg_start_backup,<br>

> rsyncing the file from the Primary to local, touch the recovery.conf,<br>

> pg_stop_backup.<br>

> And the pgpool_remote_start is just basically startup the remote DB.<br>

><br>

> The additional start stop of the DB engine in above scripts may not be<br>

> necessary but it shouldn't hurt at all and they are database engine related<br>

> I would think.<br>

><br>

> Sorry that dbllib is in house library that I may not be able to share.<br>

><br>

> Meanwhile, do you have a sample failover failback script that you can share<br>

> for me to reproduce the problem for you? I will also try to look for some<br>

> online as well.<br>

><br>

> Thanks~<br>

> Ning<br>

><br>

><br>

> On Mon, Mar 18, 2013 at 6:13 AM, Yugo Nagata <<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>> wrote:<br>

><br>

> > Hi ning,<br>

> ><br>

> > Sorry for delay in response, but unfortunately I haven't been able to<br>

> > reproduce it.<br>

> ><br>

> > In <a href="http://failover.py/failback.py" target="_blank">failover.py/failback.py</a>, the error message following occurs.<br>

> >  ImportError: No module named dbllib<br>

> ><br>

> > I can't find the module dbllib in yum repositories.<br>

> > What and how should I install to my machine?<br>

> ><br>

> > BTW, could you please tell me what your scripts' purpose are?<br>

> > I guess, failover.py does is just for touching a trigger file.<br>

> > However, I cannot understand what failback.py are for.<br>

> ><br>

> > In addition, in your pgpool_remote_start, backend DB is stopped before<br>

> > started.<br>

> > However, pgpool_remote_start doesn't have to stop backend DB because the DB<br>

> > would be stopping  after basebackup. Also, basebackup.sh don't have to<br>

> > stop &<br>

> > start backend DB. Are there any special intent for these stop & start?<br>

> ><br>

> > The information of scripts might be help for solving the problem.<br>

> ><br>

> ><br>

> > On Fri, 8 Mar 2013 00:28:20 -0600<br>

> > ning chan <<a href="mailto:ninchan8328@gmail.com">ninchan8328@gmail.com</a>> wrote:<br>

> ><br>

> > > Hi Yugo,<br>

> > > Thanks for looking at the issue, here is the exact steps i did to get in<br>

> > to<br>

> > > the problem.<br>

> > > 1) make sure replication is setup and pgpool on both server have the<br>

> > > backend value set to 2<br>

> > > 2) shutdown postgresql on the primary, this will promote the<br>

> > > standby(server1)  to become new primary<br>

> > > 3) execute pcp_recovery on server1 which will  recover the failed node<br>

> > > (server0) and connect to the new primary (server1), check backend status<br>

> > > value<br>

> > > 4) shudown postfresql on the server1 (new Primary), this should promote<br>

> > > server0 to become primary again<br>

> > > 5) execute pcp_recovery on server0 which will recover the failed node<br>

> > > (server1) and connect to the new primary (server0 again), check backend<br>

> > > status value<br>

> > > 6) go to server1, shutdown pgpool, and start it up again, pgpool at the<br>

> > > point will not be able to start anymore, server reboot is required in<br>

> > order<br>

> > > to bring pgpool online.<br>

> > ><br>

> > > I attached you the db-server0 and db-server1.log which i redirected all<br>

> > the<br>

> > > command (search for 'Issue command') I executed in above steps to the log<br>

> > > file as well, you should be able to follow it very easily.<br>

> > > I also attached you my postgresql and pgpool conf files as well as my<br>

> > > basebackup.sh and remote start script just in case you need them for<br>

> > > reproduce.<br>

> > ><br>

> > > Thanks~<br>

> > > Ning<br>

> > ><br>

> > ><br>

> > > On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata <<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>> wrote:<br>

> > ><br>

> > > > Hi ning,<br>

> > > ><br>

> > > > I tried to reproduce the bind error by repeatedly starting/stopping<br>

> > pgpools<br>

> > > > with both watchdog enabled. But I cannot see the error.<br>

> > > ><br>

> > > > Can you tell me a reliable way to to reproduce it?<br>

> > > ><br>

> > > ><br>

> > > > On Wed, 6 Mar 2013 11:21:01 -0600<br>

> > > > ning chan <<a href="mailto:ninchan8328@gmail.com">ninchan8328@gmail.com</a>> wrote:<br>

> > > ><br>

> > > > > Hi Tatsuo,<br>

> > > > ><br>

> > > > > Do you need any more data for your investigation?<br>

> > > > ><br>

> > > > > Thanks~<br>

> > > > > Ning<br>

> > > > ><br>

> > > > ><br>

> > > > > On Mon, Mar 4, 2013 at 4:08 PM, ning chan <<a href="mailto:ninchan8328@gmail.com">ninchan8328@gmail.com</a>><br>

> > wrote:<br>

> > > > ><br>

> > > > > > Hi Tatsuo,<br>

> > > > > > I shutdown one watchdog instead of both, I can't reproduce the<br>

> > problem.<br>

> > > > > ><br>

> > > > > > Here is the details:<br>

> > > > > > server0 pgpool watchdog is disabled<br>

> > > > > > server1 pgpool watchdog is enabled and it is a primary database for<br>

> > > > > > streaming replication, failover & failback works just fine; except<br>

> > > > that the<br>

> > > > > > virtual ip will not be migrated to the other pgpool server because<br>

> > > > > > watchdog on server0 is not running.<br>

> > > > > ><br>

> > > > > > FYI: as i reported on the other email thread, running watchdog on<br>

> > both<br>

> > > > > > server will not allow me to failover & failback more than once<br>

> > which I<br>

> > > > am<br>

> > > > > > still looking for root cause.<br>

> > > > > ><br>

> > > > > > 1) both node shows pool_nodes as state 2<br>

> > > > > > 2) shutdown database on server1, then cause the DB to failover to<br>

> > > > server0,<br>

> > > > > > server0 is now primary<br>

> > > > > > 3) execute pcp_recovery on server0 to bring the server1 failed<br>

> > database<br>

> > > > > > back online and connects to server0 as a standby; however,<br>

> > pool_nodes<br>

> > > > on<br>

> > > > > > server1 shows the following:<br>

> > > > > > [root@server1 data]# psql -c "show pool_nodes" -p 9999<br>

> > > > > >  node_id | hostname | port | status | lb_weight |  role<br>

> > > > > > ---------+----------+------+--------+-----------+---------<br>

> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

> > > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby<br>

> > > > > > (2 rows)<br>

> > > > > ><br>

> > > > > > As shows, server1 pgpool think itself as in state 3.<br>

> > > > > > Replication however is working fine.<br>

> > > > > ><br>

> > > > > > 4) i have to execute pcp_attach_node on server1 to bring its<br>

> > pool_nodes<br>

> > > > > > state to 2, however, server0 pool_nodes info about server1 becomes<br>

> > 3.<br>

> > > > see<br>

> > > > > > below for both servers output:<br>

> > > > > > [root@server1 data]# psql -c "show pool_nodes" -p 9999<br>

> > > > > >  node_id | hostname | port | status | lb_weight |  role<br>

> > > > > > ---------+----------+------+--------+-----------+---------<br>

> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

> > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby<br>

> > > > > ><br>

> > > > > > [root@server0 ~]# psql -c "show pool_nodes" -p 9999<br>

> > > > > >  node_id | hostname | port | status | lb_weight |  role<br>

> > > > > > ---------+----------+------+--------+-----------+---------<br>

> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

> > > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby<br>

> > > > > ><br>

> > > > > ><br>

> > > > > > 5) execute the following command on server1 will bring the server1<br>

> > > > status<br>

> > > > > > to 2 on both node:<br>

> > > > > > /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1<br>

> > > > > ><br>

> > > > > > [root@server1 data]# psql -c "show pool_nodes" -p 9999<br>

> > > > > >  node_id | hostname | port | status | lb_weight |  role<br>

> > > > > > ---------+----------+------+--------+-----------+---------<br>

> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

> > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby<br>

> > > > > ><br>

> > > > > > [root@server0 ~]# psql -c "show pool_nodes" -p 9999<br>

> > > > > >  node_id | hostname | port | status | lb_weight |  role<br>

> > > > > > ---------+----------+------+--------+-----------+---------<br>

> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

> > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby<br>

> > > > > ><br>

> > > > > > Please advise the next step.<br>

> > > > > ><br>

> > > > > > Thanks~<br>

> > > > > > Ning<br>

> > > > > ><br>

> > > > > ><br>

> > > > > > On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a><br>

> > ><br>

> > > > wrote:<br>

> > > > > ><br>

> > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:<br>

> > > > Success<br>

> > > > > >><br>

> > > > > >> This error messge seems pretty strange. ":" should be something<br>

> > like<br>

> > > > > >> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:<br>

> > > > > >> Success". To isolate the problem, can please disable watchdog and<br>

> > try<br>

> > > > > >> again?<br>

> > > > > >> --<br>

> > > > > >> Tatsuo Ishii<br>

> > > > > >> SRA OSS, Inc. Japan<br>

> > > > > >> English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

> > > > > >> Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>

> > > > > >><br>

> > > > > >><br>

> > > > > >> > Hi All,<br>

> > > > > >> > After upgrade to pgPool-II 3.2.3 and I tested my failover/<br>

> > failback<br>

> > > > > >> setup,<br>

> > > > > >> > and start / stop pgpool mutlip times, I see one of the pgpool<br>

> > goes<br>

> > > > in<br>

> > > > > >> to an<br>

> > > > > >> > unrecoverable state.<br>

> > > > > >> ><br>

> > > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown<br>

> > > > request<br>

> > > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010<br>

> > > > > >> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky:<br>

> > ifup[/sbin/ip]<br>

> > > > > >> doesn't<br>

> > > > > >> > have sticky bit<br>

> > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:<br>

> > > > Success<br>

> > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)<br>

> > > > > >> failed: No<br>

> > > > > >> > such file or directory<br>

> > > > > >> ><br>

> > > > > >> ><br>

> > > > > >> > netstat shows the following:<br>

> > > > > >> > [root@server1 ~]# netstat -na |egrep "9898|9999"<br>

> > > > > >> > tcp        0      0 <a href="http://0.0.0.0:9898" target="_blank">0.0.0.0:9898</a>                0.0.0.0:*<br>

> > > > > >> > LISTEN<br>

> > > > > >> > tcp        0      0 <a href="http://0.0.0.0:9999" target="_blank">0.0.0.0:9999</a>                0.0.0.0:*<br>

> > > > > >> > LISTEN<br>

> > > > > >> > tcp        0      0 <a href="http://172.16.6.154:46650" target="_blank">172.16.6.154:46650</a><br>

> > <a href="http://172.16.6.153:9999" target="_blank">172.16.6.153:9999</a><br>

> > > > > >> > TIME_WAIT<br>

> > > > > >> > tcp        9      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a><br>

> > <a href="http://172.16.6.153:51868" target="_blank">172.16.6.153:51868</a><br>

> > > > > >> > CLOSE_WAIT<br>

> > > > > >> > tcp        9      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a><br>

> > <a href="http://172.16.6.153:51906" target="_blank">172.16.6.153:51906</a><br>

> > > > > >> > CLOSE_WAIT<br>

> > > > > >> > tcp        0      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a><br>

> > <a href="http://172.16.6.154:50624" target="_blank">172.16.6.154:50624</a><br>

> > > > > >> > TIME_WAIT<br>

> > > > > >> > tcp        9      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a><br>

> > <a href="http://172.16.6.153:51946" target="_blank">172.16.6.153:51946</a><br>

> > > > > >> > CLOSE_WAIT<br>

> > > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18698<br>

> > > > > >>  /tmp/.s.PGSQL.9898<br>

> > > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18685<br>

> > > > > >>  /tmp/.s.PGSQL.9999<br>

> > > > > >> ><br>

> > > > > >> > Is this a known issue?<br>

> > > > > >> ><br>

> > > > > >> > I will have to reboot the server in order to start pgpool back<br>

> > > > online.<br>

> > > > > >> ><br>

> > > > > >> > My cluster has two servers (server0 & server1) which each of<br>

> > them<br>

> > > > are<br>

> > > > > >> > running pgpool, and postgreSQL with streaming Replication setup.<br>

> > > > > >> ><br>

> > > > > >> > Thanks~<br>

> > > > > >> > Ning<br>

> > > > > >><br>

> > > > > ><br>

> > > > > ><br>

> > > ><br>

> > > ><br>

> > > > --<br>

> > > > Yugo Nagata <<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>><br>

> > > ><br>

> ><br>

> ><br>

> > --<br>

> > Yugo Nagata <<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>><br>

> ><br>

<br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Yugo Nagata <<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>><br>

</font></span></blockquote></div><br></div>