[pgpool-general: 2680] Fwd: Re: Re: pcp_recovery_node failing in stage 1

Tue Apr 1 02:17:54 JST 2014

Hello Tatsuo,

Did the attached log provide any insight?

Thanks,
Sean

-------- Original Message --------
Subject: 	Re: [pgpool-general: 2639] Re: pcp_recovery_node failing in 
stage 1
Date: 	Fri, 21 Mar 2014 10:59:21 -0230
From: 	Sean Hogan <sean at compusult.net>
To: 	Tatsuo Ishii <ishii at postgresql.org>
CC: 	pgpool-general at pgpool.net

I agree, it makes no sense.  The strace is attached.

Sean

On 14-03-21 10:02 AM, Tatsuo Ishii wrote:
> Ridiculous. There's no code in pgpool which sends signal 2 to recovery
> command. Is it possible to start pgpool from strace and do the
> recovery so that we could find who sends the signal?
>
> strace -f pgpool start
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
>> The stage 1 script is not careful with exit codes, so it continues
>> after the failed rsync and eventually exits with success.  This tricks
>> pgpool into continuing with stage 2, but it's definitely the state 1
>> command that is failing.
>>
>> Sean
>>
>> On 14-03-21 06:20 AM, Tatsuo Ishii wrote:
>>>> Sorry, the subject line should have said stage *1*.
>>> Really? From what I read from pgpool log:
>>>
>>> 2014-03-20 12:42:43 LOG:   pid 18259: 1st stage is done
>>> 2014-03-20 12:42:43 LOG:   pid 18259: starting 2nd stage
>>> 2014-03-20 12:42:47 LOG:   pid 18259: CHECKPOINT in the 2nd stage done
>>> 2014-03-20 12:42:47 LOG: pid 18259: starting recovery command: "SELECT
>>> pgpool_recovery('pgpool_recovery_pitr.sh', 'psql02.compusult.net',
>>> '/var/lib/pgsql/9.2/data')"
>>> 2014-03-20 12:42:49 LOG: pid 18259: check_postmaster_started: try to
>>> connect to postmaster on hostname:psql02.compusult.net
>>> database:postgres user:postgres (retry 0 times)
>>>
>>> I saw "1st stage is done" and I guess the first stage has been
>>> succeeded but the second stage failed. What does the second stage look
>>> like?
>>>
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>>
>>>> On 14-03-20 12:48 PM, Sean Hogan wrote:
>>>>> Hi,
>>>>>
>>>>> In my setup at the moment I have a pair of version 3.3.2 pgpool
>>>>> instances with two backend PostgreSQL 9.2.4 servers, all running on
>>>>> CentOS 6.4.  The PostgreSQL data directories are quite large - 144GB.
>>>>> I have run into a situation where pcp_recovery_node consistently fails
>>>>> with a BackendError.
>>>>>
>>>>> The stage 1 recovery command is a script called do-base-backup.sh that
>>>>> runs an rsync as follows:
>>>>>
>>>>>       rsync -Cacvv --delete \
>>>>>               --exclude postmaster.pid --exclude postmaster.opts \
>>>>>               --exclude recovery.done \
>>>>>               --exclude pg_log/\* --exclude pg_xlog/\* \
>>>>>               $SOURCE/ $DESTINATION/ 2>&1 |
>>>>>       mailx -s "rsync verbose output" sean at compusult.net
>>>>>
>>>>> For some reason this rsync is failing after some minutes (typically 10
>>>>> to 12) with undocumented exit code 255.  The verbose rsync logging
>>>>> says this:
>>>>>
>>>>>       Killed by signal 2.
>>>>>       rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
>>>>>       Broken pipe (32)
>>>>>       rsync: connection unexpectedly closed (50735 bytes received so far)
>>>>>       [sender]
>>>>>       rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
>>>>>
>>>>> Googling has not brought up anything helpful other than bugs with
>>>>> large files in older versions of rsync.  I'm fairly certain that is
>>>>> not the case here, especially because of the "Killed by signal 2",
>>>>> which is suggestive of some sort of timeout on the pgpool end.
>>>>>
>>>>> The specific command line I'm using to recover the second database
>>>>> node is:
>>>>>
>>>>>       sudo -u postgres /usr/local/bin/pcp_recovery_node 10000 psql01 9898
>>>>>       postgres XXXXXX 1
>>>>>
>>>>> With such a large timeout value I shouldn't be hitting a timeout
>>>>> there.
>>>>>
>>>>> The weird thing, which makes me point the finger at either pgpool or
>>>>> pcp_recovery_node, is that if I run do-base-backup.sh manually it
>>>>> works fine (and takes much much longer, as expected).
>>>>>
>>>>> Does pgpool have some internal limit on how long it will wait for the
>>>>> 1st stage command to run?  I've attached the log file but it isn't
>>>>> very informative.  (Note that the do-base-backup.sh script isn't
>>>>> communicating the rsync failure back to pgpool, so pgpool goes ahead
>>>>> and runs stage 2.  Of course, that fails because not everything has
>>>>> been synced.)
>>>>>
>>>>> Thanks,
>>>>> Sean
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> pgpool-general mailing list
>>>>> pgpool-general at pgpool.net
>>>>> http://www.pgpool.net/mailman/listinfo/pgpool-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140331/2576a440/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool.log.gz
Type: application/x-gzip
Size: 114969 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140331/2576a440/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sean.vcf
Type: text/x-vcard
Size: 288 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140331/2576a440/attachment.vcf>