[pgpool-hackers: 3916] Re: ERROR: failed to process PCP request at the moment

Tue Jun 8 19:26:58 JST 2021

> I created 4-node cluster using pgpool_setup then detached node 0 using pcp_detach_node:
> 
> $ pgpool_setup -n 4
> $ ./startall
> $ pcp_detatch_node -p 11001 0
> 
> This resulted in node 3 is left in down status which is not an
> expected result. I found following in the pgpool.log:
> 
> 2021-06-05 07:22:17: follow_child pid 6593: LOG:  execute command: /home/t-ishii/work/Pgpool-II/current/x/etc/follow_primary.sh 3 /tmp 11005 /home/t-ishii/work/Pgpool-II/current/x/data3 1 0 /tmp 0 11002 /home/t-ishii/work/Pgpool-II/current/x/data0
> 2021-06-05 07:22:17: pcp_main pid 6848: LOG:  forked new pcp worker, pid=7027 socket=6
> 2021-06-05 07:22:17: pcp_child pid 7027: ERROR:  failed to process PCP request at the moment
> 2021-06-05 07:22:17: pcp_child pid 7027: DETAIL:  failback is in progress
> 
> Why this happens?
> 
> When follow primary command is needed to be executed, pgpool main
> forks off follow_child process. The process then tries to run
> pcp_recovery_node and it sends the request to pcp child
> process. Unfortunately the request is denied because failover/failback
> is in progress.
> 
> static void
> pcp_process_command(char tos, char *buf, int buf_len)
> {
> 	if (tos == 'O' || tos == 'T')
> 	{
> 		if (Req_info->switching)
> 		{
> 			if (Req_info->request_queue_tail != Req_info->request_queue_head)
> 			{
> 				POOL_REQUEST_KIND reqkind;
> 
> 				reqkind = Req_info->request[(Req_info->request_queue_head + 1) % MAX_REQUEST_QUEUE_SIZE].kind;
> 
> 				if (reqkind == NODE_UP_REQUEST)
> 					ereport(ERROR,
> 							(errmsg("failed to process PCP request at the moment"),
> 							 errdetail("failback is in progress")));
> 
> In the code fragment, 'O' means online recovery request,
> Req_info->switching indicates failover/failback is ongoing.  So
> pcp_recovery_node cannot be executed while failover/failback is
> ongoing. Since pcp_recover_node issues a failback request in the end,
> it is not surprising that if there are multiple nodes to be recovered
> while executing follow primary command, a pcp_recovery_node run is
> canceled by a failback request from other node's pcp_recovery_node.
> 
> Recent commit:
> https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=455f00dd5f5b7b94bd91aa0b6b40aab21dceabb9
> 
> lets follow primary process take the follow primary lock. Also in
> failover/failback pgpool main process tries to take the lock in
> find_primary_node_repeatedly() but it be would blocked since follow
> primary process already acquired the lock.
> 
> How to solve this?
> 
> The purpose of the follow primary lock is to prevent concurrent run of
> follow primary command and detach false primary by the streaming
> replication check. We cannot throw it away. However it is not always
> necessary to acquire the lock by find_primary_node_repeatedly(). If
> it does not try to acquire the lock, failover/failback will not be
> blocked and will finish soon, thus Req_info->switching flags will be
> promptly turned to false.
> 
> When a primary node is detached, failover command is called and new
> primary is selected. At this point find_primary_node_repeatedly() is
> surely needed to run to find the new primary. However, once follow
> primary command starts, the primary will not be changed. So my idea
> is, find_primary_node_repeatedly() checks whether follow primary
> command is running or not. If it is running, just returns the current
> primary. Otherwise acquires the lock.
> 
> Attached is the patch to implement this. For this purpose, new shared
> memory variable Req_info->follow_primary_ongoing was introduced. The
> flag is set/unset by follow primary process.

Fix committed with a regression test.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=59fdb1b8d598e61c62053ad70f3e8e4140b453e7

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp