[pgpool-general: 9261] Re: Segmentation fault during shutdown

Thu Nov 7 23:28:08 JST 2024

Hi,

Unfortunately, it seems the patch did not fix this issue. Yesterday we had
a segmentation fault at this same point again. The top of the backtrace now
is:
#0  close_all_backend_connections () at protocol/pool_connection_pool.c:1082
#1  0x0000563a51f9280f in proc_exit_prepare (code=-1) at
../../src/utils/error/elog.c:2707
#2  0x00007f0926782da7 in __funcs_on_exit () at src/exit/atexit.c:34
#3  0x00007f092677a08f in exit (code=code at entry=0) at src/exit/exit.c:29
#4  0x0000563a51f4e4e2 in child_exit (code=0) at protocol/child.c:1378
#5  die (sig=3) at protocol/child.c:1174
#6  <signal handler called>

As you can see, it now crashes at line 1082 in pool_connection_pool.c,
which looks like this in our patched version:
1074  for (i = 0; i < pool_config->max_pool; i++, p++)
1075  {
1076      int backend_id = in_use_backend_id(p);
1077
1078      if (backend_id < 0)
1079          continue;
1080      if (CONNECTION_SLOT(p, backend_id) == NULL)
1081          continue;
1082      if (CONNECTION_SLOT(p, backend_id)->sp == NULL)
1083          continue;
1084      if (CONNECTION_SLOT(p, backend_id)->sp->user == NULL)
1085          continue;
1086      pool_send_frontend_exits(p);
1087  }

At the moment of the crash, a lot is happening at the same time. We are
reducing a cluster back to a single node. The crash happens at the very
last moment, when only the final remaining node is still up and running,
but it still is running with cluster configuration (with a watchdog and 2
backends, the local one up, the remote one down). Our configuration
management then restarts the database (to force a configuration change on
postgresql). Looking at the logs, this shutdown is noticed by pgpool, but
the watchdog does not hold a quorum, so it cannot initiate a failover
(also, there's no backend to failover to). Then, within a second, pgpool
itself is also shutdown. This is when the process segfaults. Something that
does seem interesting is that the pid (183) that segfaults, seems to be
started during the failover process. pgpool is simultaneously killing all
connection pids and starting this one. Also, this pid is killed within a
single ms of being started (see timestamp 2024-11-07T00:48:06.906935
and 2024-11-07T00:48:06.907304 in the logs). I hope this helps in tracking
this issue down.

Best regards,
Emond

Op wo 18 sep 2024 om 04:17 schreef Tatsuo Ishii <ishii at postgresql.org>:

> Okay.
> Please let us know if you notice something.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi,
> >
> > Thanks for the patch. I've added it to our build. This crash is quite
> rare,
> > so I guess the only way of knowing if this fixed the error is by
> observing
> > the build for the next couple of months.
> >
> > Best regards,
> > Emond
> >
> > Op di 17 sep 2024 om 08:37 schreef Tatsuo Ishii <ishii at postgresql.org>:
> >
> >> > Thanks for the report.
> >> >
> >> > Yes, it seems the crash happened when close_all_backend_connections()
> >> > was called by on_exit which is called when process exits. I will look
> >> > into this.
> >>
> >> close_all_backend_connections() is responsible for closing pooled
> >> connections to backend.  In the code MAIN_CONNECTION() macro is
> >> used. Pooled connections could contain connections pointing to backend
> >> which was valid at some point but is in down state at present. So
> >> instead of MAIN_CONNECTION, we should use in_use_backend_id() here.
> >> Attached patch does this. I hope the patch fixes your problem.
> >>
> >> Best reagards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> >> Hi,
> >> >>
> >> >> One of our test runs this weekend hit another segmentation fault.
> This
> >> >> crash seems to happen when pgpool is shutdown. This happens at the
> end
> >> of
> >> >> the testcase that reverts a cluster back to a single node setup. At
> that
> >> >> moment, 172.29.30.2 is already shutdown and removed from the cluster
> and
> >> >> 172.29.30.3 is shutdown. The configuration is updated and pgpool on
> >> >> 172.29.30.1 is restarted. The crash seems to happen at the moment
> >> ppgool on
> >> >> 172.29.30.1 is shutdown to be restarted. I've got the feeling that
> the
> >> >> simultaneous loss of .3 and the shutdown is causing this crash.
> >> >>
> >> >> Below is the backtrace. Please not we've switched from Debian to
> Alpine
> >> >> based images.
> >> >> #0  0x000055fe4225f0ab in close_all_backend_connections () at
> >> >> protocol/pool_connection_pool.c:1078
> >> >> #1  0x000055fe422917ef in proc_exit_prepare (code=-1) at
> >> >> ../../src/utils/error/elog.c:2707
> >> >> #2  0x00007ff1af359da7 in __funcs_on_exit () at src/exit/atexit.c:34
> >> >> #3  0x00007ff1af35108f in exit (code=code at entry=0) at
> >> src/exit/exit.c:29
> >> >> #4  0x000055fe4224d4d2 in child_exit (code=0) at
> protocol/child.c:1378
> >> >> #5  die (sig=3) at protocol/child.c:1174
> >> >> #6  <signal handler called>
> >> >> #7  memset () at src/string/x86_64/memset.s:55
> >> >> #8  0x000055fe4225d2ed in memset (__n=<optimized out>, __c=0,
> >> >> __d=<optimized out>) at /usr/include/fortify/string.h:75
> >> >> #9  pool_init_cp () at protocol/pool_connection_pool.c:83
> >> >> #10 0x000055fe4224f5f0 in do_child (fds=fds at entry=0x7ff1a6aabae0) at
> >> >> protocol/child.c:222
> >> >> #11 0x000055fe42223ebe in fork_a_child (fds=0x7ff1a6aabae0, id=11) at
> >> >> main/pgpool_main.c:863
> >> >> #12 0x000055fe42229d90 in exec_child_restart (node_id=0,
> >> >> failover_context=0x7ffcd98e8c50) at main/pgpool_main.c:4684
> >> >> #13 failover () at main/pgpool_main.c:1739
> >> >> #14 0x000055fe42228cd9 in sigusr1_interrupt_processor () at
> >> >> main/pgpool_main.c:1507
> >> >> #15 0x000055fe4222900f in check_requests () at
> main/pgpool_main.c:4934
> >> >> #16 0x000055fe4222ce53 in PgpoolMain
> >> (discard_status=discard_status at entry=0
> >> >> '\000', clear_memcache_oidmaps=clear_memcache_oidmaps at entry=0
> '\000')
> >> at
> >> >> main/pgpool_main.c:649
> >> >> #17 0x000055fe42222713 in main (argc=<optimized out>, argv=<optimized
> >> out>)
> >> >> at main/main.c:365
> >> >>
> >> >> Best regards,
> >> >> Emond
> >> > _______________________________________________
> >> > pgpool-general mailing list
> >> > pgpool-general at pgpool.net
> >> > http://www.pgpool.net/mailman/listinfo/pgpool-general
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20241107/aafc8793/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool-segfault.log.gz
Type: application/gzip
Size: 407062 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20241107/aafc8793/attachment-0001.gz>