[pgpool-hackers: 3916] Re: ERROR: failed to process PCP request at the moment
Tatsuo Ishii
ishii at sraoss.co.jp
Tue Jun 8 19:26:58 JST 2021
> I created 4-node cluster using pgpool_setup then detached node 0 using pcp_detach_node:
>
> $ pgpool_setup -n 4
> $ ./startall
> $ pcp_detatch_node -p 11001 0
>
> This resulted in node 3 is left in down status which is not an
> expected result. I found following in the pgpool.log:
>
> 2021-06-05 07:22:17: follow_child pid 6593: LOG: execute command: /home/t-ishii/work/Pgpool-II/current/x/etc/follow_primary.sh 3 /tmp 11005 /home/t-ishii/work/Pgpool-II/current/x/data3 1 0 /tmp 0 11002 /home/t-ishii/work/Pgpool-II/current/x/data0
> 2021-06-05 07:22:17: pcp_main pid 6848: LOG: forked new pcp worker, pid=7027 socket=6
> 2021-06-05 07:22:17: pcp_child pid 7027: ERROR: failed to process PCP request at the moment
> 2021-06-05 07:22:17: pcp_child pid 7027: DETAIL: failback is in progress
>
> Why this happens?
>
> When follow primary command is needed to be executed, pgpool main
> forks off follow_child process. The process then tries to run
> pcp_recovery_node and it sends the request to pcp child
> process. Unfortunately the request is denied because failover/failback
> is in progress.
>
> static void
> pcp_process_command(char tos, char *buf, int buf_len)
> {
> if (tos == 'O' || tos == 'T')
> {
> if (Req_info->switching)
> {
> if (Req_info->request_queue_tail != Req_info->request_queue_head)
> {
> POOL_REQUEST_KIND reqkind;
>
> reqkind = Req_info->request[(Req_info->request_queue_head + 1) % MAX_REQUEST_QUEUE_SIZE].kind;
>
> if (reqkind == NODE_UP_REQUEST)
> ereport(ERROR,
> (errmsg("failed to process PCP request at the moment"),
> errdetail("failback is in progress")));
>
> In the code fragment, 'O' means online recovery request,
> Req_info->switching indicates failover/failback is ongoing. So
> pcp_recovery_node cannot be executed while failover/failback is
> ongoing. Since pcp_recover_node issues a failback request in the end,
> it is not surprising that if there are multiple nodes to be recovered
> while executing follow primary command, a pcp_recovery_node run is
> canceled by a failback request from other node's pcp_recovery_node.
>
> Recent commit:
> https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=455f00dd5f5b7b94bd91aa0b6b40aab21dceabb9
>
> lets follow primary process take the follow primary lock. Also in
> failover/failback pgpool main process tries to take the lock in
> find_primary_node_repeatedly() but it be would blocked since follow
> primary process already acquired the lock.
>
> How to solve this?
>
> The purpose of the follow primary lock is to prevent concurrent run of
> follow primary command and detach false primary by the streaming
> replication check. We cannot throw it away. However it is not always
> necessary to acquire the lock by find_primary_node_repeatedly(). If
> it does not try to acquire the lock, failover/failback will not be
> blocked and will finish soon, thus Req_info->switching flags will be
> promptly turned to false.
>
> When a primary node is detached, failover command is called and new
> primary is selected. At this point find_primary_node_repeatedly() is
> surely needed to run to find the new primary. However, once follow
> primary command starts, the primary will not be changed. So my idea
> is, find_primary_node_repeatedly() checks whether follow primary
> command is running or not. If it is running, just returns the current
> primary. Otherwise acquires the lock.
>
> Attached is the patch to implement this. For this purpose, new shared
> memory variable Req_info->follow_primary_ongoing was introduced. The
> flag is set/unset by follow primary process.
Fix committed with a regression test.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=59fdb1b8d598e61c62053ad70f3e8e4140b453e7
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
More information about the pgpool-hackers
mailing list