[pgpool-general: 9263] Re: Segmentation fault during shutdown

Fri Nov 8 20:22:58 JST 2024

> Hi Emond,
> Thank you for the report. I will look into this.
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>> Hi,
>> Unfortunately, it seems the patch did not fix this issue. Yesterday we
 had
>> a segmentation fault at this same point again. The top of the backtrace
 now
>> is:
>> #0  close_all_backend_connections () at
 protocol/pool_connection_pool.c:1082
>> #1  0x0000563a51f9280f in proc_exit_prepare (code=-1) at
>> ../../src/utils/error/elog.c:2707
>> #2  0x00007f0926782da7 in __funcs_on_exit () at src/exit/atexit.c:34
>> #3  0x00007f092677a08f in exit (code=code at entry=0) at src/exit/exit.c:29
>> #4  0x0000563a51f4e4e2 in child_exit (code=0) at protocol/child.c:1378
>> #5  die (sig=3) at protocol/child.c:1174
>> #6  <signal handler called>
>> As you can see, it now crashes at line 1082 in pool_connection_pool.c,
>> which looks like this in our patched version:
>> 1074  for (i = 0; i < pool_config->max_pool; i++, p++)
>> 1075  {
>> 1076      int backend_id = in_use_backend_id(p);
>> 1077
>> 1078      if (backend_id < 0)
>> 1079          continue;
>> 1080      if (CONNECTION_SLOT(p, backend_id) == NULL)
>> 1081          continue;
>> 1082      if (CONNECTION_SLOT(p, backend_id)->sp == NULL)
>> 1083          continue;
>> 1084      if (CONNECTION_SLOT(p, backend_id)->sp->user == NULL)
>> 1085          continue;
>> 1086      pool_send_frontend_exits(p);
>> 1087  }
>> At the moment of the crash, a lot is happening at the same time. We are
>> reducing a cluster back to a single node. The crash happens at the very
>> last moment, when only the final remaining node is still up and running,
>> but it still is running with cluster configuration (with a watchdog and 2
>> backends, the local one up, the remote one down). Our configuration
>> management then restarts the database (to force a configuration change on
>> postgresql). Looking at the logs, this shutdown is noticed by pgpool, but
>> the watchdog does not hold a quorum, so it cannot initiate a failover
>> (also, there's no backend to failover to). Then, within a second, pgpool
>> itself is also shutdown. This is when the process segfaults. Something
 that
>> does seem interesting is that the pid (183) that segfaults, seems to be
>> started during the failover process. pgpool is simultaneously killing all
>> connection pids and starting this one. Also, this pid is killed within a
>> single ms of being started (see timestamp 2024-11-07T00:48:06.906935
>> and 2024-11-07T00:48:06.907304 in the logs). I hope this helps in
 tracking
>> this issue down.
The crash is inside close_all_backend_connections(), which is called
when pgpool child process is about to exit. The function should never
crash if the connection pool has been already initialized. The
initialization is done in pool_init_cp(). The function is called at
the very early stage of pgpool child process starts up. It allocates
memory by using palloc then clear the memory using memset().
>> connection pids and starting this one. Also, this pid is killed within a
>> single ms of being started (see timestamp 2024-11-07T00:48:06.906935
>> and 2024-11-07T00:48:06.907304 in the logs). I hope this helps in
 tracking
So my guess is, the process (pid 183) was killed in the middle of
memset() and the connection pool object was left with uninitialized
data.
The purpose of close_all_backend_connections() is sending an 'X'
(terminate) message to backend telling that pgpool is about to
disconnect. This is a polite way from frontend/backend protocol's
point of view, but it's hard to make it work in the situation you
reported because we need to establish a way to not call
close_all_backend_connections() if the initialization is not
done. Alternative way is, to not call close_all_backend_connections()
at all at the process exit. This will solve the problem in simple and
reliable way. Disconnecting connection to backend without sending 'X'
message is not prohibited by the protocol, I believe. Also as far as I
know PostgreSQL can handle the situation (disconnected without a prior
'X' message from frontend).
Patch attached.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
Mime-Version: 1.0
Content-Type: Text/X-Patch; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="fix_segfault.patch"

diff --git a/src/protocol/child.c b/src/protocol/child.c
index c12a5a2c1..4e2a3443b 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -1350,9 +1350,12 @@ child_will_go_down(int code, Datum arg)
 		memcached_disconnect();
 	}
 
-	/* let backend know now we are exiting */
-	if (pool_connection_pool)
-		close_all_backend_connections();
+	/*
+	 * We used to call close_all_backend_connections() here so that we send
+	 * 'X' (terminate) message to backend. However it was possible that the
+	 * function is called while initializing the connection pool object, which
+	 * leads to crash. So we stopped to call close_all_backend_connections().
+	 */
 }
 
 /*