[Pgpool-general] online recovery of primary is possible or not?
Tatsuo Ishii
ishii at sraoss.co.jp
Tue Nov 23 10:12:42 UTC 2010
Ok, now your problem is the standby does not become online.
Add:
log_min_messages = debug5
into your postgresql.conf may help you.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
> Because they give me problems, I do not remember exactly what
> problems, I made 1000 experiments!
>
> Anyway, I have restored pg_start_bacup and pg_stop_backupand without
> comments and relaunched everything.
>
> I have killed the primary and next I've tried to re-took it up as
> standby with:
>
> /usr/lib/pgpool-II/3.0.1/bin/pcp_recovery_node 20 localhost 9898 user
> password 0
>
> Here is my logs:
>
> PGpool:
> 2010-11-23 10:05:16 LOG: pid 12955: Backend status file
> /var/log/pgpool/pgpool_status does not exist
> 2010-11-23 10:05:17 LOG: pid 12955: pgpool-II successfully
> started. version 3.0.1 (umiyameboshi)
> 2010-11-23 10:05:47 ERROR: pid 12990: pool_read: EOF encountered with
> backend
> 2010-11-23 10:05:47 ERROR: pid 12990: do_query: error while reading
> message kind
> 2010-11-23 10:05:47 ERROR: pid 12990: check_replication_time_lag:
> SELECT pg_current_xlog_location() failed
> 2010-11-23 10:05:47 ERROR: pid 12955: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2010-11-23 10:05:47 ERROR: pid 12955: health check failed. 0 th host
> 192.168.0.128 at port 5433 is down
> 2010-11-23 10:05:47 LOG: pid 12955: set 0 th backend down status
> 2010-11-23 10:05:47 LOG: pid 12955: starting degeneration. shutdown
> host 192.168.0.128(5433)
> 2010-11-23 10:05:47 LOG: pid 12955: execute command:
> /usr/lib/pgpool-II/3.0.1/bin/failover.sh 192.168.0.128 192.168.0.127
> /tmp/trigger_file0
> 2010-11-23 10:05:52 LOG: pid 12955: failover_handler: set new master
> node: 1
> 2010-11-23 10:05:52 LOG: pid 12955: failover done. shutdown host
> 192.168.0.128(5433)
> 2010-11-23 10:05:53 LOG: pid 12989: starting recovering node 0
> 2010-11-23 10:05:53 LOG: pid 12989: starting recovery command: "SELECT
> pgpool_recovery('basebackup.sh', '192.168.0.128',
> '/home/database/9.0.1/data')"
> 2010-11-23 10:06:22 LOG: pid 12989: 1st stage is done
> 2010-11-23 10:06:27 LOG: pid 12989: check_postmaster_started: try to
> connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis (retry 0 times)
> 2010-11-23 10:06:27 LOG: pid 12989: check_postmaster_started: failed
> to connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis
> 2010-11-23 10:06:30 LOG: pid 12989: check_postmaster_started: try to
> connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis (retry 1 times)
> 2010-11-23 10:06:30 LOG: pid 12989: check_postmaster_started: failed
> to connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis
> 2010-11-23 10:06:33 LOG: pid 12989: check_postmaster_started: try to
> connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis (retry 2 times)
> ...
>
> Ex primary now standby:
> LOG: database system was interrupted; last known up at 2010-11-23
> 10:05:24 CET
> LOG: entering standby mode
> LOG: streaming replication successfully connected to primary
> LOG: redo starts at 3/3D000020
> LOG: consistent recovery state reached at 3/3E000000
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> LOG: received smart shutdown request
> FATAL: terminating walreceiver process due to administrator command
> LOG: shutting down
>
> Ex standby, now primary:
> LOG: trigger file found: /tmp/trigger_file0
> LOG: redo is not required
> LOG: selected new timeline ID: 31
> LOG: archive recovery complete
> LOG: database system is ready to accept connections
> + CLUSTER=/home/database/9.0.1/data
> + PSQL=/usr/lib/postgresql/9.0.1/bin/psql
> + PORT=5433
> + MASTER_DB_CLUSTER=/home/database/9.0.1/data
> + RECOVERY_NODE_HOSTNAME=192.168.0.128
> + RECOVERY_DB_CLUSTER=/home/database/9.0.1/data
> + find -type f -name recovery.conf
> + FIND_RECOVERY_CONF=
> + find -type f -name recovery.done
> + FIND_RECOVERY_DONE=./recovery.done
> + echo 192.168.0.128
> + cut -d . -f 4
> + RECOVERY_NODE_HOST=128
> + /sbin/ifconfig eth0
> + grep Bcast
> + cut -d : -f 2
> + awk { print $1 }
> + cut -d . -f 4
> + ACTUAL_NODE=127
> + [ 127 -eq 128 ]
> + ACTUAL_NODE_ID=1
> + RECOVERY_HOST=192.168.0.127
> + [ ! -z ]
> + [ ! -z ./recovery.done ]
> + /usr/lib/postgresql/9.0.1/bin/psql -p 5433 -c SELECT
> pg_start_backup('Streaming Replication', true) postgres
> pg_start_backup
> -----------------
> 3/3D000020
> (1 row)
>
> + RECOVERY_CONF=standby_mode = 'on'\nprimary_conninfo =
> 'host=192.168.0.127 port=5433 user=postgis
> password=gispost'\ntrigger_file = '/tmp/trigger_file1'\n
> + echo standby_mode = 'on'\nprimary_conninfo = 'host=192.168.0.127
> port=5433 user=postgis password=gispost'\ntrigger_file =
> '/tmp/trigger_file1'\n
> + rsync -C -a -c --delete --exclude postgresql.conf --exclude
> postmaster.pid --exclude postmaster.opts --exclude pg_log --exclude
> pg_hba.conf --exclude recovery.conf --exclude recovery.done --exclude
> remote.recovery.conf --exclude pg_xlog /home/database/9.0.1/data/
> 192.168.0.128:/home/database/9.0.1/data/
> + /usr/bin/ssh -T postgres at 192.168.0.128 /bin/mkdir -p
> /home/database/9.0.1/data/pg_xlog
> + /bin/chmod 700 /home/database/9.0.1/data/pg_xlog /bin/rm -f
> /home/database/9.0.1/data/recovery.done
> + scp /home/database/9.0.1/data/remote.recovery.conf
> postgres at 192.168.0.128:/home/database/9.0.1/data/recovery.conf
> + /bin/rm -f /home/database/9.0.1/data/recovery.done
> /home/database/9.0.1/data/remote.recovery.conf
> + /usr/lib/postgresql/9.0.1/bin/psql -p 5433 -c SELECT
> pg_stop_backup() postgres
> NOTICE: WAL archiving is not enabled; you must ensure that all
> required WAL segments are copied through other means to complete the
> backup
> pg_stop_backup
> ----------------
> 3/3D0000C4
> (1 row)
>
> + exit 0
> pgpool_remote_start 192.168.0.128 /home/database/9.0.1/data
> server starting
> LOG: replication connection authorized: user=postgis
> host=192.168.0.128 port=34861
>
>> It seems pg_start_bacup and pg_stop_backup are commented out in your
>> script. Why?
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>>> I have changed my basebackup.sh file a lot to get it work on 3
>>> different machines...
>>> Here it is:
>>>
>>> #!/bin/sh -x
>>> # It prints on postgresql log file
>>> # Execution example: ./basebackup.sh /home/database/9.0.1/data
>>> # 192.168.0.127 /home/database/9.0.1/data
>>>
>>> # At the beginning: 192.168.0.128 --> id 0 --> primary; 192.168.0.127
>>> # --> id 1 --> standby
>>>
>>> # Postgres data files on both machines (source and destination)
>>> CLUSTER="/home/database/9.0.1/data"
>>> PSQL="/usr/lib/postgresql/9.0.1/bin/psql"
>>> PORT="5433"
>>>
>>> MASTER_DB_CLUSTER=${1}
>>> RECOVERY_NODE_HOSTNAME=${2}
>>> RECOVERY_DB_CLUSTER=${3}
>>> FIND_RECOVERY_CONF=`find -type f -name recovery.conf`
>>> FIND_RECOVERY_DONE=`find -type f -name recovery.done`
>>>
>>> # Only last part of ip address (192.168.0.127 --> 127)
>>> RECOVERY_NODE_HOST=`echo "${RECOVERY_NODE_HOSTNAME}" | cut -d '.' -f
>>> 4`
>>> ACTUAL_NODE=`/sbin/ifconfig eth0 | grep 'Bcast' | cut -d ':' -f 2 |
>>> awk '{ print $1 }' | cut -d '.' -f 4`
>>>
>>> if [ ${ACTUAL_NODE} -eq "128" ] ; then
>>> ACTUAL_NODE_ID="0"
>>> RECOVERY_HOST="192.168.0.128"
>>> else
>>> ACTUAL_NODE_ID="1"
>>> RECOVERY_HOST="192.168.0.127"
>>> fi
>>>
>>> # If recovery.conf or recovery.done file is found, it means that the
>>> # primary has been crashed
>>> if [ ! -z ${FIND_RECOVERY_CONF} ] || [ ! -z ${FIND_RECOVERY_DONE} ] ;
>>> then
>>> #${PSQL} -p ${PORT} -c "SELECT pg_start_backup('Streaming Replication',
>>> #true)" postgres
>>>
>>> RECOVERY_CONF="standby_mode = 'on'\nprimary_conninfo =
>>> 'host=${RECOVERY_HOST} port=${PORT} user=postgis
>>> password=gispost'\ntrigger_file =
>>> '/tmp/trigger_file${ACTUAL_NODE_ID}'\n"
>>> echo "${RECOVERY_CONF}"> ${CLUSTER}/remote.recovery.conf
>>>
>>> rsync -C -a -c --delete --exclude postgresql.conf --exclude
>>> postmaster.pid \
>>> --exclude postmaster.opts --exclude pg_log --exclude pg_hba.conf\
>>> --exclude recovery.conf --exclude recovery.done --exclude
>>> --remote.recovery.conf \
>>> --exclude pg_xlog \
>>> ${CLUSTER}/ ${RECOVERY_NODE_HOSTNAME}:${CLUSTER}/&& \
>>>
>>> # Everything performed on the remote machine
>>> /usr/bin/ssh -T postgres@${RECOVERY_NODE_HOSTNAME} /bin/mkdir -p
>>> ${CLUSTER}/pg_xlog&& \
>>> /bin/chmod 700 ${CLUSTER}/pg_xlog \
>>> /bin/rm -f ${CLUSTER}/recovery.done
>>>
>>> scp ${CLUSTER}/remote.recovery.conf
>>> postgres@${RECOVERY_NODE_HOSTNAME}:${CLUSTER}/recovery.conf
>>>
>>> # Delete recovery.done and tmp.recovery.conf locally
>>> /bin/rm -f ${CLUSTER}/recovery.done ${CLUSTER}/remote.recovery.conf
>>>
>>> #${PSQL} -p ${PORT} -c "SELECT pg_stop_backup()" postgres
>>>
>>> exit 0
>>> # If recovery.done or recovery.conf file is not found, it means that the
>>> # standby has crashed
>>> else
>>> exit 0
>>> fi
>>>
>>> And this is my pgpool_remote_start:
>>>
>>> if [ ${#} -ne 2 ] ; then
>>> echo "pgpool_remote_start remote_host remote_datadir"
>>> exit 1
>>> fi
>>>
>>> DEST=${1}
>>> DESTDIR=${2}
>>> PGCTL="/usr/lib/postgresql/9.0.1/bin/pg_ctl"
>>> LOGFILE="/home/database/9.0.1/postgresql-9.0.1.log"
>>>
>>> echo "pgpool_remote_start ${DEST} ${DESTDIR}"
>>> ssh -T postgres@${DEST} ${PGCTL} -D ${DESTDIR} -l ${LOGFILE} start
>>>
>>> Where am I wrong?
>>>
>>> Thanks in advance,
>>>
>>> Alessandro
>>>
>>>>> I've read here:
>>>>> http://pgpool.projects.postgresql.org/contrib_docs/simple_sr_setting/index.html
>>>>> that online recovery with streaming replication is possible.
>>>>> But in the official doc it is said that only with standby it's
>>>>> possible to
>>>>> perform online recovery
>>>>> (http://pgpool.projects.postgresql.org/pgpool-II/doc/pgpool-en.html#master_slave_mode)...who
>>>>> is right?
>>>> Both are correct. In the former, here is the supposed scenario:
>>>>
>>>> 1) primary goes down
>>>> 2) standby takes over and becomes primary
>>>> 3) By using online recovery, the downed former primary could be online
>>>> as a new standby.
>>>>
>>>>> I've tried to configure pgpool to perform an online recovery anyway,
>>>>> as
>>>>> described in the first link, but without success. Unlike that tutorial
>>>>> I have
>>>>> three separated machines which are primary, standby and pgpool.
>>>>>
>>>>> Everything works fine but at the very last step, postgres on the
>>>>> ex-primary (now
>>>>> standby) says:
>>>>> LOG: entering standby mode
>>>>> LOG: streaming replication successfully connected to primary
>>>>> LOG: unexpected timeline ID 24 in log file 3, segment 60, offset 0
>>>>> FATAL: terminating walreceiver process due to administrator command
>>>>> LOG: unexpected timeline ID 24 in log file 3, segment 60, offset 0
>>>> As far as I can tell, it seems the base backup from primary to standby
>>>> was not done properly. What is your baseback.sh script exactly?
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese: http://www.sraoss.co.jp
>>>
>>> --
>>> Alessandro Candini
>>>
>>> MEEO S.r.l.
>>> Via Saragat 9
>>> I-44122 Ferrara, Italy
>>> Tel: +39 0532 1861501
>>> Fax: +39 0532 1861637
>>> http://www.meeo.it
>>>
>>> ========================================
>>> "ATTENZIONE:le informazioni contenute in questo messaggio sono
>>> da considerarsi confidenziali ed il loro utilizzo è riservato
>>> unicamente
>>> al destinatario sopra indicato. Chi dovesse ricevere questo messaggio
>>> per errore è tenuto ad informare il mittente ed a rimuoverlo
>>> definitivamente da ogni supporto elettronico o cartaceo."
>>>
>>> "WARNING:This message contains confidential and/or proprietary
>>> information which may be subject to privilege or immunity and which
>>> is intended for use of its addressee only. Should you receive this
>>> message in error, you are kindly requested to inform the sender and
>>> to definitively remove it from any paper or electronic format."
>>>
>
>
> --
> Alessandro Candini
>
> MEEO S.r.l.
> Via Saragat 9
> I-44122 Ferrara, Italy
> Tel: +39 0532 1861501
> Fax: +39 0532 1861637
> http://www.meeo.it
>
> ========================================
> "ATTENZIONE:le informazioni contenute in questo messaggio sono
> da considerarsi confidenziali ed il loro utilizzo è riservato
> unicamente
> al destinatario sopra indicato. Chi dovesse ricevere questo messaggio
> per errore è tenuto ad informare il mittente ed a rimuoverlo
> definitivamente da ogni supporto elettronico o cartaceo."
>
> "WARNING:This message contains confidential and/or proprietary
> information which may be subject to privilege or immunity and which
> is intended for use of its addressee only. Should you receive this
> message in error, you are kindly requested to inform the sender and
> to definitively remove it from any paper or electronic format."
>
More information about the Pgpool-general
mailing list