[Pgpool-general] online recovery of primary is possible or not?

Tue Nov 23 10:12:42 UTC 2010

Ok, now your problem is the standby does not become online.

Add:
log_min_messages = debug5
into your postgresql.conf may help you.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Because they give me problems, I do not remember exactly what
> problems, I made 1000 experiments!
> 
> Anyway, I have restored pg_start_bacup and pg_stop_backupand without
> comments and relaunched everything.
> 
> I have killed the primary and next I've tried to re-took it up as
> standby with:
> 
> /usr/lib/pgpool-II/3.0.1/bin/pcp_recovery_node 20 localhost 9898 user
> password 0
> 
> Here is my logs:
> 
> PGpool:
> 2010-11-23 10:05:16 LOG: pid 12955: Backend status file
> /var/log/pgpool/pgpool_status does not exist
> 2010-11-23 10:05:17 LOG: pid 12955: pgpool-II successfully
> started. version 3.0.1 (umiyameboshi)
> 2010-11-23 10:05:47 ERROR: pid 12990: pool_read: EOF encountered with
> backend
> 2010-11-23 10:05:47 ERROR: pid 12990: do_query: error while reading
> message kind
> 2010-11-23 10:05:47 ERROR: pid 12990: check_replication_time_lag:
> SELECT pg_current_xlog_location() failed
> 2010-11-23 10:05:47 ERROR: pid 12955: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2010-11-23 10:05:47 ERROR: pid 12955: health check failed. 0 th host
> 192.168.0.128 at port 5433 is down
> 2010-11-23 10:05:47 LOG: pid 12955: set 0 th backend down status
> 2010-11-23 10:05:47 LOG: pid 12955: starting degeneration. shutdown
> host 192.168.0.128(5433)
> 2010-11-23 10:05:47 LOG: pid 12955: execute command:
> /usr/lib/pgpool-II/3.0.1/bin/failover.sh 192.168.0.128 192.168.0.127
> /tmp/trigger_file0
> 2010-11-23 10:05:52 LOG: pid 12955: failover_handler: set new master
> node: 1
> 2010-11-23 10:05:52 LOG: pid 12955: failover done. shutdown host
> 192.168.0.128(5433)
> 2010-11-23 10:05:53 LOG: pid 12989: starting recovering node 0
> 2010-11-23 10:05:53 LOG: pid 12989: starting recovery command: "SELECT
> pgpool_recovery('basebackup.sh', '192.168.0.128',
> '/home/database/9.0.1/data')"
> 2010-11-23 10:06:22 LOG: pid 12989: 1st stage is done
> 2010-11-23 10:06:27 LOG: pid 12989: check_postmaster_started: try to
> connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis (retry 0 times)
> 2010-11-23 10:06:27 LOG: pid 12989: check_postmaster_started: failed
> to connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis
> 2010-11-23 10:06:30 LOG: pid 12989: check_postmaster_started: try to
> connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis (retry 1 times)
> 2010-11-23 10:06:30 LOG: pid 12989: check_postmaster_started: failed
> to connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis
> 2010-11-23 10:06:33 LOG: pid 12989: check_postmaster_started: try to
> connect to postmaster on hostname:192.168.0.128 database:postgres
> user:postgis (retry 2 times)
> ...
> 
> Ex primary now standby:
> LOG: database system was interrupted; last known up at 2010-11-23
> 10:05:24 CET
> LOG: entering standby mode
> LOG: streaming replication successfully connected to primary
> LOG: redo starts at 3/3D000020
> LOG: consistent recovery state reached at 3/3E000000
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> LOG: received smart shutdown request
> FATAL: terminating walreceiver process due to administrator command
> LOG: shutting down
> 
> Ex standby, now primary:
> LOG: trigger file found: /tmp/trigger_file0
> LOG: redo is not required
> LOG: selected new timeline ID: 31
> LOG: archive recovery complete
> LOG: database system is ready to accept connections
> + CLUSTER=/home/database/9.0.1/data
> + PSQL=/usr/lib/postgresql/9.0.1/bin/psql
> + PORT=5433
> + MASTER_DB_CLUSTER=/home/database/9.0.1/data
> + RECOVERY_NODE_HOSTNAME=192.168.0.128
> + RECOVERY_DB_CLUSTER=/home/database/9.0.1/data
> + find -type f -name recovery.conf
> + FIND_RECOVERY_CONF=
> + find -type f -name recovery.done
> + FIND_RECOVERY_DONE=./recovery.done
> + echo 192.168.0.128
> + cut -d . -f 4
> + RECOVERY_NODE_HOST=128
> + /sbin/ifconfig eth0
> + grep Bcast
> + cut -d : -f 2
> + awk { print $1 }
> + cut -d . -f 4
> + ACTUAL_NODE=127
> + [ 127 -eq 128 ]
> + ACTUAL_NODE_ID=1
> + RECOVERY_HOST=192.168.0.127
> + [ ! -z ]
> + [ ! -z ./recovery.done ]
> + /usr/lib/postgresql/9.0.1/bin/psql -p 5433 -c SELECT
> pg_start_backup('Streaming Replication', true) postgres
> pg_start_backup
> -----------------
> 3/3D000020
> (1 row)
> 
> + RECOVERY_CONF=standby_mode = 'on'\nprimary_conninfo =
> 'host=192.168.0.127 port=5433 user=postgis
> password=gispost'\ntrigger_file = '/tmp/trigger_file1'\n
> + echo standby_mode = 'on'\nprimary_conninfo = 'host=192.168.0.127
> port=5433 user=postgis password=gispost'\ntrigger_file =
> '/tmp/trigger_file1'\n
> + rsync -C -a -c --delete --exclude postgresql.conf --exclude
> postmaster.pid --exclude postmaster.opts --exclude pg_log --exclude
> pg_hba.conf --exclude recovery.conf --exclude recovery.done --exclude
> remote.recovery.conf --exclude pg_xlog /home/database/9.0.1/data/
> 192.168.0.128:/home/database/9.0.1/data/
> + /usr/bin/ssh -T postgres at 192.168.0.128 /bin/mkdir -p
> /home/database/9.0.1/data/pg_xlog
> + /bin/chmod 700 /home/database/9.0.1/data/pg_xlog /bin/rm -f
> /home/database/9.0.1/data/recovery.done
> + scp /home/database/9.0.1/data/remote.recovery.conf
> postgres at 192.168.0.128:/home/database/9.0.1/data/recovery.conf
> + /bin/rm -f /home/database/9.0.1/data/recovery.done
> /home/database/9.0.1/data/remote.recovery.conf
> + /usr/lib/postgresql/9.0.1/bin/psql -p 5433 -c SELECT
> pg_stop_backup() postgres
> NOTICE: WAL archiving is not enabled; you must ensure that all
> required WAL segments are copied through other means to complete the
> backup
> pg_stop_backup
> ----------------
> 3/3D0000C4
> (1 row)
> 
> + exit 0
> pgpool_remote_start 192.168.0.128 /home/database/9.0.1/data
> server starting
> LOG: replication connection authorized: user=postgis
> host=192.168.0.128 port=34861
> 
>> It seems pg_start_bacup and pg_stop_backup are commented out in your
>> script. Why?
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>>> I have changed my basebackup.sh file a lot to get it work on 3
>>> different machines...
>>> Here it is:
>>>
>>> #!/bin/sh -x
>>> # It prints on postgresql log file
>>> # Execution example: ./basebackup.sh /home/database/9.0.1/data
>>> # 192.168.0.127 /home/database/9.0.1/data
>>>
>>> # At the beginning: 192.168.0.128 -->  id 0 -->  primary; 192.168.0.127
>>> # -->  id 1 -->  standby
>>>
>>> # Postgres data files on both machines (source and destination)
>>> CLUSTER="/home/database/9.0.1/data"
>>> PSQL="/usr/lib/postgresql/9.0.1/bin/psql"
>>> PORT="5433"
>>>
>>> MASTER_DB_CLUSTER=${1}
>>> RECOVERY_NODE_HOSTNAME=${2}
>>> RECOVERY_DB_CLUSTER=${3}
>>> FIND_RECOVERY_CONF=`find -type f -name recovery.conf`
>>> FIND_RECOVERY_DONE=`find -type f -name recovery.done`
>>>
>>> # Only last part of ip address (192.168.0.127 -->  127)
>>> RECOVERY_NODE_HOST=`echo "${RECOVERY_NODE_HOSTNAME}" | cut -d '.' -f
>>> 4`
>>> ACTUAL_NODE=`/sbin/ifconfig eth0 | grep 'Bcast' | cut -d ':' -f 2 |
>>> awk '{ print $1 }' | cut -d '.' -f 4`
>>>
>>> if [ ${ACTUAL_NODE} -eq "128" ] ; then
>>>      ACTUAL_NODE_ID="0"
>>>      RECOVERY_HOST="192.168.0.128"
>>> else
>>>      ACTUAL_NODE_ID="1"
>>>      RECOVERY_HOST="192.168.0.127"
>>> fi
>>>
>>> # If recovery.conf or recovery.done file is found, it means that the
>>> # primary has been crashed
>>> if [ ! -z ${FIND_RECOVERY_CONF} ] || [ ! -z ${FIND_RECOVERY_DONE} ] ;
>>> then
>>>      #${PSQL} -p ${PORT} -c "SELECT pg_start_backup('Streaming Replication',
>>>      #true)" postgres
>>>
>>>      RECOVERY_CONF="standby_mode = 'on'\nprimary_conninfo =
>>>      'host=${RECOVERY_HOST} port=${PORT} user=postgis
>>>      password=gispost'\ntrigger_file =
>>>      '/tmp/trigger_file${ACTUAL_NODE_ID}'\n"
>>>      echo "${RECOVERY_CONF}">  ${CLUSTER}/remote.recovery.conf
>>>
>>>      rsync -C -a -c --delete --exclude postgresql.conf --exclude
>>>      postmaster.pid \
>>>      --exclude postmaster.opts --exclude pg_log --exclude pg_hba.conf\
>>>      --exclude recovery.conf --exclude recovery.done --exclude
>>>      --remote.recovery.conf \
>>>      --exclude pg_xlog \
>>>      ${CLUSTER}/ ${RECOVERY_NODE_HOSTNAME}:${CLUSTER}/&&  \
>>>
>>>      # Everything performed on the remote machine
>>>      /usr/bin/ssh -T postgres@${RECOVERY_NODE_HOSTNAME} /bin/mkdir -p
>>>      ${CLUSTER}/pg_xlog&&  \
>>>      /bin/chmod 700 ${CLUSTER}/pg_xlog \
>>>      /bin/rm -f ${CLUSTER}/recovery.done
>>>
>>>      scp ${CLUSTER}/remote.recovery.conf
>>>      postgres@${RECOVERY_NODE_HOSTNAME}:${CLUSTER}/recovery.conf
>>>
>>>      # Delete recovery.done and tmp.recovery.conf locally
>>>      /bin/rm -f ${CLUSTER}/recovery.done ${CLUSTER}/remote.recovery.conf
>>>
>>>      #${PSQL} -p ${PORT} -c "SELECT pg_stop_backup()" postgres
>>>
>>>      exit 0
>>> # If recovery.done or recovery.conf file is not found, it means that the
>>> # standby has crashed
>>> else
>>>      exit 0
>>> fi
>>>
>>> And this is my pgpool_remote_start:
>>>
>>> if [ ${#} -ne 2 ] ; then
>>>      echo "pgpool_remote_start remote_host remote_datadir"
>>>      exit 1
>>> fi
>>>
>>> DEST=${1}
>>> DESTDIR=${2}
>>> PGCTL="/usr/lib/postgresql/9.0.1/bin/pg_ctl"
>>> LOGFILE="/home/database/9.0.1/postgresql-9.0.1.log"
>>>
>>> echo "pgpool_remote_start ${DEST} ${DESTDIR}"
>>> ssh -T postgres@${DEST} ${PGCTL} -D ${DESTDIR} -l ${LOGFILE} start
>>>
>>> Where am I wrong?
>>>
>>> Thanks in advance,
>>>
>>> Alessandro
>>>
>>>>> I've read here:
>>>>> http://pgpool.projects.postgresql.org/contrib_docs/simple_sr_setting/index.html
>>>>> that online recovery with streaming replication is possible.
>>>>> But in the official doc it is said that only with standby it's
>>>>> possible to
>>>>> perform online recovery
>>>>> (http://pgpool.projects.postgresql.org/pgpool-II/doc/pgpool-en.html#master_slave_mode)...who
>>>>> is right?
>>>> Both are correct. In the former, here is the supposed scenario:
>>>>
>>>> 1) primary goes down
>>>> 2) standby takes over and becomes primary
>>>> 3) By using online recovery, the downed former primary could be online
>>>>      as a new standby.
>>>>
>>>>> I've tried to configure pgpool to perform an online recovery anyway,
>>>>> as
>>>>> described in the first link, but without success. Unlike that tutorial
>>>>> I have
>>>>> three separated machines which are primary, standby and pgpool.
>>>>>
>>>>> Everything works fine but at the very last step, postgres on the
>>>>> ex-primary (now
>>>>> standby) says:
>>>>> LOG:  entering standby mode
>>>>> LOG:  streaming replication successfully connected to primary
>>>>> LOG:  unexpected timeline ID 24 in log file 3, segment 60, offset 0
>>>>> FATAL:  terminating walreceiver process due to administrator command
>>>>> LOG:  unexpected timeline ID 24 in log file 3, segment 60, offset 0
>>>> As far as I can tell, it seems the base backup from primary to standby
>>>> was not done properly. What is your baseback.sh script exactly?
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese: http://www.sraoss.co.jp
>>>
>>> -- 
>>> Alessandro Candini
>>>
>>> MEEO S.r.l.
>>> Via Saragat 9
>>> I-44122 Ferrara, Italy
>>> Tel: +39 0532 1861501
>>> Fax: +39 0532 1861637
>>> http://www.meeo.it
>>>
>>> ========================================
>>> "ATTENZIONE:le informazioni contenute in questo messaggio sono
>>> da considerarsi confidenziali ed il loro utilizzo è riservato
>>> unicamente
>>> al destinatario sopra indicato. Chi dovesse ricevere questo messaggio
>>> per errore è tenuto ad informare il mittente ed a rimuoverlo
>>> definitivamente da ogni supporto elettronico o cartaceo."
>>>
>>> "WARNING:This message contains confidential and/or proprietary
>>> information which may be subject to privilege or immunity and which
>>> is intended for use of its addressee only. Should you receive this
>>> message in error, you are kindly requested to inform the sender and
>>> to definitively remove it from any paper or electronic format."
>>>
> 
> 
> -- 
> Alessandro Candini
> 
> MEEO S.r.l.
> Via Saragat 9
> I-44122 Ferrara, Italy
> Tel: +39 0532 1861501
> Fax: +39 0532 1861637
> http://www.meeo.it
> 
> ========================================
> "ATTENZIONE:le informazioni contenute in questo messaggio sono
> da considerarsi confidenziali ed il loro utilizzo è riservato
> unicamente
> al destinatario sopra indicato. Chi dovesse ricevere questo messaggio
> per errore è tenuto ad informare il mittente ed a rimuoverlo
> definitivamente da ogni supporto elettronico o cartaceo."
> 
> "WARNING:This message contains confidential and/or proprietary
> information which may be subject to privilege or immunity and which
> is intended for use of its addressee only. Should you receive this
> message in error, you are kindly requested to inform the sender and
> to definitively remove it from any paper or electronic format."
>