[pgpool-hackers: 224] Re: Unexpected behavior of pgpool-II

Asif Naeem anaeem.it at gmail.com
Mon Apr 29 18:05:16 JST 2013


Please find attached file, It contains more feedback from customer. Thanks.

Best Regards,
Asif Naeem

On Sun, Apr 28, 2013 at 3:25 PM, Asif Naeem <anaeem.it at gmail.com> wrote:

> Hi Tatsuo,
>
> Please find the attach file, it contains pgpool log information with
> details. Thanks.
>
> Best Regards,
> Asif Naeem
>
> On Mon, Apr 22, 2013 at 3:34 AM, Tatsuo Ishii <ishii at postgresql.org>wrote:
>
>> Can you please show pgpool log?
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Hi,
>> >
>> > There is unexpected behavior in pgpool2 that seems to be a bug. Please
>> find
>> > the details as following i.e.
>> >
>> > Phenomenon
>> >
>> >> pgpool2 configuration option "follow_master_command" special character
>> %H
>> >> (Hostname of the new master node) is not working as expected
>> >> In master slave mode after failover, 2nd node become new master but got
>> >> unexpected hostname (via %H). Please find attached pgpool.conf for
>> details.
>> >
>> >
>> > Versions
>> >
>> >> OS version: RHEL - 64
>> >> pgpool version: pgpool-II version 3.2.1 (namameboshi)
>> >
>> >
>> > *Details/Observations* :-
>> >
>> >> At Start
>> >
>> >>  node_id | hostname | port | status | lb_weight |  role
>> >> ---------+----------+------+--------+-----------+---------
>> >>  0       | dmp02    | 5444 | 2      | 0.333333  | primary
>> >>  1       | dmp03    | 5444 | 2      | 0.333333  | standby
>> >>  2       | dmp04    | 5444 | 2      | 0.333333  | standby
>> >>
>> >
>> >
>> >> Failover arguments when the user stopped DMP02.
>> >
>> >> 2013îN 2åé 8ì˙ ã‡ójì˙ 12:44:12 JST
>> >
>> > FAILED_NODE_ID =0
>> >> FAILED_NODE_NAME=dmp02
>> >> FAILED_NODE_PORT=5444
>> >> FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
>> >> NEW_MASTER_ID =1
>> >> OLD_MASTER_ID =0
>> >> NEW_MASTER_NAME =dmp03
>> >> OLD_PRIMARY_ID =0
>> >> NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
>> >> 0
>> >>
>> >
>> >
>> >> Then, these arguments told user that the master node has been changed
>> >
>> >> 2013îN 2åé 8ì˙ ã‡ójì˙ 12:44:13 JST
>> >> FAILED_NODE_ID =0
>> >> FAILED_NODE_NAME=dmp02
>> >> FAILED_NODE_PORT=5444
>> >> FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
>> >> NEW_MASTER_ID =1
>> >> OLD_MASTER_ID =0
>> >> NEW_MASTER_NAME =dmp03
>> >> OLD_PRIMARY_ID =0
>> >> NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
>> >> 2013îN 2åé 8ì˙ ã‡ójì˙ 12:44:17 JST
>> >> FAILED_NODE_ID =2
>> >> FAILED_NODE_NAME=dmp04
>> >> FAILED_NODE_PORT=5444
>> >> FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
>> >> NEW_MASTER_ID =1
>> >> OLD_MASTER_ID =0
>> >> NEW_MASTER_NAME =dmp02  *<--- *** The master name is unexpected ****
>> >> OLD_PRIMARY_ID =0
>> >> NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
>> >>
>> >
>> >
>> >>  node_id | hostname | port | status | lb_weight |  role
>> >> ---------+----------+------+--------+-----------+---------
>> >>  0       | dmp02    | 5444 | 2      | 0.333333  | standby
>> >>  1       | dmp03    | 5444 | 2      | 0.333333  | primary
>> >>  2       | dmp04    | 5444 | 2      | 0.333333  | standby
>> >>
>> >
>> >
>> >>   pid  | usesysid |   usename    | application_name |  client_addr  |
>> >> client_hostname | client_port |    backend_start                 |
>> state
>> >>   | sent_location | write_location | flush_location | replay_location |
>> >> sync_priority | sync_state
>> >>
>> >>
>> -------+----------+--------------+------------------+---------------+-----------------+-------------+----------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
>> >>  20641 |       10 | enterprisedb | dmp02            | 172.24.217.23 |
>> >>             |       16948 | 08-FEB-13 17:39:44.078173 +09:00 |
>> streaming |
>> >> 14/C50000E0   | 14/C50000E0    | 14/C50000E0    | 14/C50000E0     |
>> >>     0 | async
>> >> (1 çs)
>> >>
>> >
>> > There is one pg_stat_replication record because dmp04 is connecting to
>> dmp02
>> >
>> > This phenomenon is reproduced by user. What we need to do is just make
>> the
>> > scripts like follow_master.sh and launch pgpool process with -F [path to
>> > pcp.conf].
>> >
>> > 1. Before failover
>> >
>> >> -bash-3.2$ psql -h 10.1.1.187 -p 9999 -c "show pool_nodes"
>> >> node_id | hostname | port | status | lb_weight | role
>> >> ---------+----------+------+--------+-----------+---------
>> >> 0 | slave1 | 5444 | 2 | 0.333333 | primary
>> >> 1 | slave2 | 5444 | 2 | 0.333333 | standby
>> >> 2 | slave3 | 5444 | 2 | 0.333333 | standby
>> >> (3 行)
>> >
>> >
>> > 2. We stopped slave 1 node with "pg_ctl -D /data -m immediate stop".
>> >
>> > 3. After failover
>> >
>> >> -bash-3.2$ psql -h 10.1.1.187 -p 9999 -c "show pool_nodes"
>> >> node_id | hostname | port | status | lb_weight | role
>> >> ---------+----------+------+--------+-----------+---------
>> >> 0 | slave1 | 5444 | 2 | 0.333333 | standby
>> >> 1 | slave2 | 5444 | 2 | 0.333333 | primary
>> >> 2 | slave3 | 5444 | 2 | 0.333333 | standby
>> >> (3 行)
>> >
>> >
>> > 4. Then user got output like below after failover that is not expected
>> that
>> > is present at the end of follow_master.sh.log i.e.
>> >
>> >> NEW_MASTER_NAME =slave1
>> >
>> >
>> > [root at master1 ~]# cat /tmp/failover.sh.log
>> >
>> >> 2013年 3月 13日 水曜日 11:36:28 JST
>> >> FAILED_NODE_ID =0
>> >> FAILED_NODE_NAME=slave1
>> >> FAILED_NODE_PORT=5444
>> >> FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
>> >> NEW_MASTER_ID =1
>> >> OLD_MASTER_ID =0
>> >> NEW_MASTER_NAME =slave2
>> >> OLD_PRIMARY_ID =0
>> >> NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
>> >> 0
>> >>
>> > [root at master1 ~]# cat /tmp/follow_master.sh.log
>> >
>> >> 2013年 3月 13日 水曜日 11:36:30 JST
>> >> FAILED_NODE_ID =0
>> >> FAILED_NODE_NAME=slave1
>> >> FAILED_NODE_PORT=5444
>> >> FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
>> >> NEW_MASTER_ID =1
>> >> OLD_MASTER_ID =0
>> >> NEW_MASTER_NAME =slave2
>> >> OLD_PRIMARY_ID =0
>> >> NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
>> >> 2013年 3月 13日 水曜日 11:36:32 JST
>> >> FAILED_NODE_ID =2
>> >> FAILED_NODE_NAME=slave3
>> >> FAILED_NODE_PORT=5444
>> >> FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
>> >> NEW_MASTER_ID =1
>> >> OLD_MASTER_ID =0
>> >> NEW_MASTER_NAME =*slave1
>> >> *OLD_PRIMARY_ID =0
>> >> NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
>> >
>> >
>> > Please guide. Thanks.
>> >
>> > Best Regards,
>> > Asif Naeem
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20130429/c83f182a/attachment.html>
-------------- next part --------------
Along with the issue mentioned, As we had looked at this issue again like mentioned below, kindly give us your idea about it.
According to our analysis, it looked like follow_master_command was issued twice on this environment.
So what we want confirm at this time is how the follow_master_command work?
Our recognition so far is that the failover_command is issued once when the pgpool detect a fail node,
and then follow_master_command is issued once for new master node.
However according to the log mentioned below showed us was that the follow_master_command was issued twice.

---
↓The part of the log mentioned below should be done by SRHS_failover.sh

+ PG_CTL=/opt/PostgresPlus/9.2AS/bin/pg_ctl
+ FAILED_NODE_ID=0
+ FAILED_NODE_NAME=slave1
+ FAILED_NODE_PORT=5444
+ FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
+ NEW_MASTER_ID=1
+ OLD_MASTER_ID=0
+ NEW_MASTER_NAME=slave2
+ OLD_PRIMARY_ID=0
+ NEW_MASTER_DATA=/opt/PostgresPlus/9.2AS/data
+ TRIGGER=/tmp/trigger_file1
+ IDENTITY_FILE=/root/.ssh/id_rsa
+ '[' 0 = 0 ']'
++ date
+ echo $'2013\345\271\264' $'4\346\234\210' $'25\346\227\245' $'\346\234\250\346\233\234\346\227\245' 21:06:25 JST
+ echo FAILED_NODE_ID =0
+ echo FAILED_NODE_NAME=slave1
+ echo FAILED_NODE_PORT=5444
+ echo FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
+ echo NEW_MASTER_ID =1
+ echo OLD_MASTER_ID =0
+ echo NEW_MASTER_NAME =slave2
+ echo OLD_PRIMARY_ID =0
+ echo NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
+ ssh -T enterprisedb at slave2 /opt/PostgresPlus/9.2AS/bin/pg_ctl promote -D /opt/PostgresPlus/9.2AS/data
サーバを昇進中です。
+ echo 0
+ exit 0

↓ Then, the remaining log mentioned below should be done by SRHS_follow_master.sh

+ LOGFILE=/var/log/pgpool/follow_master.log
+ PCP_ATTACH_NODE=/opt/PostgresPlus/9.2AS/bin/pcp_attach_node
+ FOLLOW_CMD=/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh
+ PG_CTL=/opt/PostgresPlus/9.2AS/bin/pg_ctl
+ IDENTITY_FILE=/root/.ssh/id_rsa
+ TIMEOUT=10
+ HOST=localhost
+ PORT=9898
+ USERID=enterprisedb
+ PASSWD=redhat
+ FAILED_NODE_ID=0
+ FAILED_NODE_NAME=slave1
+ FAILED_NODE_PORT=5444
+ FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
+ NEW_MASTER_ID=1
+ OLD_MASTER_ID=0
+ NEW_MASTER_NAME=slave2
+ OLD_PRIMARY_ID=0
+ NEW_MASTER_DATA=/opt/PostgresPlus/9.2AS/data
++ date
+ echo $'2013\345\271\264' $'4\346\234\210' $'25\346\227\245' $'\346\234\250\346\233\234\346\227\245' 21:06:26 JST
+ echo FAILED_NODE_ID =0
+ echo FAILED_NODE_NAME=slave1
+ echo FAILED_NODE_PORT=5444
+ echo FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
+ echo NEW_MASTER_ID =1
+ echo OLD_MASTER_ID =0
+ echo NEW_MASTER_NAME =slave2
+ echo OLD_PRIMARY_ID =0
+ echo NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
+ NEW_MASTER_HOST=slave2
+ SLAVE_HOST=slave1
+ SLAVE_BASEDIR=/opt/PostgresPlus/9.2AS/data
+ NODE_ID=0
++ date '+%Y/%m/%d %H:%M:%S'
+ echo 2013/04/25 21:06:26 START '(/opt/PostgresPlus/9.2AS/data/SRHS_follow_master.sh' 0 slave1 5444 /opt/PostgresPlus/9.2AS/data 1 0 slave2 0 '/opt/PostgresPlus/9.2AS/data)'
+ ssh -T enterprisedb at slave2 /opt/PostgresPlus/9.2AS/bin/pg_ctl promote -D /opt/PostgresPlus/9.2AS/data
pg_ctl: サーバを昇進できません。サーバはスタンバイモードではありません。
++ ssh -i /root/.ssh/id_rsa enterprisedb at slave1 /opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh slave2 slave1 /opt/PostgresPlus/9.2AS/data
+ MSG='/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh: line 22: /var/log/pgpool/follow_master.log: そのようなファイルやディレクトリはありません
touch: cannot touch `SRHS_follow_sub.sh.lock'\'': 許可がありません
/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh: line 65: /var/log/pgpool/follow_master.log: そのようなファイルやディ レクトリはありません
サーバの起動完了を待っています....完了
サーバ起動完了
/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh: line 75: /var/log/pgpool/follow_master.log: そのようなファイルやディ レクトリはありません'
+ RC=0
+ '[' 0 '!=' 0 ']'
++ date '+%Y/%m/%d %H:%M:%S'
+ echo 2013/04/25 21:06:28 PostgreSQL 'Restarted(slave1)'
+ /opt/PostgresPlus/9.2AS/bin/pcp_attach_node 10 localhost 9898 enterprisedb redhat 0
+ RC=0
+ '[' 0 '!=' 0 ']'
++ date '+%Y/%m/%d %H:%M:%S'
+ echo 2013/04/25 21:06:28 END '(/opt/PostgresPlus/9.2AS/data/SRHS_follow_master.sh)'
+ exit 0

↓ Besides, the log mentioned below is looked like done by SRHS_follow_master.sh again.

+ LOGFILE=/var/log/pgpool/follow_master.log
+ PCP_ATTACH_NODE=/opt/PostgresPlus/9.2AS/bin/pcp_attach_node
+ FOLLOW_CMD=/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh
+ PG_CTL=/opt/PostgresPlus/9.2AS/bin/pg_ctl
+ IDENTITY_FILE=/root/.ssh/id_rsa
+ TIMEOUT=10
+ HOST=localhost
+ PORT=9898
+ USERID=enterprisedb
+ PASSWD=redhat
+ FAILED_NODE_ID=2
+ FAILED_NODE_NAME=slave3
+ FAILED_NODE_PORT=5444
+ FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
+ NEW_MASTER_ID=1
+ OLD_MASTER_ID=0
+ NEW_MASTER_NAME=slave1
+ OLD_PRIMARY_ID=0
+ NEW_MASTER_DATA=/opt/PostgresPlus/9.2AS/data
++ date
+ echo $'2013\345\271\264' $'4\346\234\210' $'25\346\227\245' $'\346\234\250\346\233\234\346\227\245' 21:06:28 JST
+ echo FAILED_NODE_ID =2
+ echo FAILED_NODE_NAME=slave3
+ echo FAILED_NODE_PORT=5444
+ echo FAILED_NODE_DATA=/opt/PostgresPlus/9.2AS/data
+ echo NEW_MASTER_ID =1
+ echo OLD_MASTER_ID =0
+ echo NEW_MASTER_NAME =slave1
+ echo OLD_PRIMARY_ID =0
+ echo NEW_MASTER_DATA =/opt/PostgresPlus/9.2AS/data
+ NEW_MASTER_HOST=slave1
+ SLAVE_HOST=slave3
+ SLAVE_BASEDIR=/opt/PostgresPlus/9.2AS/data
+ NODE_ID=2
++ date '+%Y/%m/%d %H:%M:%S'
+ echo 2013/04/25 21:06:28 START '(/opt/PostgresPlus/9.2AS/data/SRHS_follow_master.sh' 2 slave3 5444 /opt/PostgresPlus/9.2AS/data 1 0 slave1 0 '/opt/PostgresPlus/9.2AS/data)'
+ ssh -T enterprisedb at slave1 /opt/PostgresPlus/9.2AS/bin/pg_ctl promote -D /opt/PostgresPlus/9.2AS/data
サーバを昇進中です。
++ ssh -i /root/.ssh/id_rsa enterprisedb at slave3 /opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh slave1 slave3 /opt/PostgresPlus/9.2AS/data
+ MSG='/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh: line 22: /var/log/pgpool/follow_master.log: そのようなファイルやディレクトリはありません
touch: cannot touch `SRHS_follow_sub.sh.lock'\'': 許可がありません
/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh: line 65: /var/log/pgpool/follow_master.log: そのようなファイルやディ レクトリはありません
サーバの起動完了を待っています....完了
サーバ起動完了
/opt/PostgresPlus/9.2AS/data/SRHS_follow_sub.sh: line 75: /var/log/pgpool/follow_master.log: そのようなファイルやディ レクトリはありません'
+ RC=0
+ '[' 0 '!=' 0 ']'
++ date '+%Y/%m/%d %H:%M:%S'
+ echo 2013/04/25 21:06:30 PostgreSQL 'Restarted(slave3)'
+ /opt/PostgresPlus/9.2AS/bin/pcp_attach_node 10 localhost 9898 enterprisedb redhat 2
+ RC=0
+ '[' 0 '!=' 0 ']'
++ date '+%Y/%m/%d %H:%M:%S'
+ echo 2013/04/25 21:06:30 END '(/opt/PostgresPlus/9.2AS/data/SRHS_follow_master.sh)'
+ exit 0
---


More information about the pgpool-hackers mailing list