[pgpool-general: 9262] Re: Segmentation fault during shutdown

Fri Nov 8 09:28:12 JST 2024

Hi Emond,

Thank you for the report. I will look into this.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi,
> 
> Unfortunately, it seems the patch did not fix this issue. Yesterday we had
> a segmentation fault at this same point again. The top of the backtrace now
> is:
> #0  close_all_backend_connections () at protocol/pool_connection_pool.c:1082
> #1  0x0000563a51f9280f in proc_exit_prepare (code=-1) at
> ../../src/utils/error/elog.c:2707
> #2  0x00007f0926782da7 in __funcs_on_exit () at src/exit/atexit.c:34
> #3  0x00007f092677a08f in exit (code=code at entry=0) at src/exit/exit.c:29
> #4  0x0000563a51f4e4e2 in child_exit (code=0) at protocol/child.c:1378
> #5  die (sig=3) at protocol/child.c:1174
> #6  <signal handler called>
> 
> As you can see, it now crashes at line 1082 in pool_connection_pool.c,
> which looks like this in our patched version:
> 1074  for (i = 0; i < pool_config->max_pool; i++, p++)
> 1075  {
> 1076      int backend_id = in_use_backend_id(p);
> 1077
> 1078      if (backend_id < 0)
> 1079          continue;
> 1080      if (CONNECTION_SLOT(p, backend_id) == NULL)
> 1081          continue;
> 1082      if (CONNECTION_SLOT(p, backend_id)->sp == NULL)
> 1083          continue;
> 1084      if (CONNECTION_SLOT(p, backend_id)->sp->user == NULL)
> 1085          continue;
> 1086      pool_send_frontend_exits(p);
> 1087  }
> 
> At the moment of the crash, a lot is happening at the same time. We are
> reducing a cluster back to a single node. The crash happens at the very
> last moment, when only the final remaining node is still up and running,
> but it still is running with cluster configuration (with a watchdog and 2
> backends, the local one up, the remote one down). Our configuration
> management then restarts the database (to force a configuration change on
> postgresql). Looking at the logs, this shutdown is noticed by pgpool, but
> the watchdog does not hold a quorum, so it cannot initiate a failover
> (also, there's no backend to failover to). Then, within a second, pgpool
> itself is also shutdown. This is when the process segfaults. Something that
> does seem interesting is that the pid (183) that segfaults, seems to be
> started during the failover process. pgpool is simultaneously killing all
> connection pids and starting this one. Also, this pid is killed within a
> single ms of being started (see timestamp 2024-11-07T00:48:06.906935
> and 2024-11-07T00:48:06.907304 in the logs). I hope this helps in tracking
> this issue down.
> 
> Best regards,
> Emond
> 
> Op wo 18 sep 2024 om 04:17 schreef Tatsuo Ishii <ishii at postgresql.org>:
> 
>> Okay.
>> Please let us know if you notice something.
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi,
>> >
>> > Thanks for the patch. I've added it to our build. This crash is quite
>> rare,
>> > so I guess the only way of knowing if this fixed the error is by
>> observing
>> > the build for the next couple of months.
>> >
>> > Best regards,
>> > Emond
>> >
>> > Op di 17 sep 2024 om 08:37 schreef Tatsuo Ishii <ishii at postgresql.org>:
>> >
>> >> > Thanks for the report.
>> >> >
>> >> > Yes, it seems the crash happened when close_all_backend_connections()
>> >> > was called by on_exit which is called when process exits. I will look
>> >> > into this.
>> >>
>> >> close_all_backend_connections() is responsible for closing pooled
>> >> connections to backend.  In the code MAIN_CONNECTION() macro is
>> >> used. Pooled connections could contain connections pointing to backend
>> >> which was valid at some point but is in down state at present. So
>> >> instead of MAIN_CONNECTION, we should use in_use_backend_id() here.
>> >> Attached patch does this. I hope the patch fixes your problem.
>> >>
>> >> Best reagards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >> >> Hi,
>> >> >>
>> >> >> One of our test runs this weekend hit another segmentation fault.
>> This
>> >> >> crash seems to happen when pgpool is shutdown. This happens at the
>> end
>> >> of
>> >> >> the testcase that reverts a cluster back to a single node setup. At
>> that
>> >> >> moment, 172.29.30.2 is already shutdown and removed from the cluster
>> and
>> >> >> 172.29.30.3 is shutdown. The configuration is updated and pgpool on
>> >> >> 172.29.30.1 is restarted. The crash seems to happen at the moment
>> >> ppgool on
>> >> >> 172.29.30.1 is shutdown to be restarted. I've got the feeling that
>> the
>> >> >> simultaneous loss of .3 and the shutdown is causing this crash.
>> >> >>
>> >> >> Below is the backtrace. Please not we've switched from Debian to
>> Alpine
>> >> >> based images.
>> >> >> #0  0x000055fe4225f0ab in close_all_backend_connections () at
>> >> >> protocol/pool_connection_pool.c:1078
>> >> >> #1  0x000055fe422917ef in proc_exit_prepare (code=-1) at
>> >> >> ../../src/utils/error/elog.c:2707
>> >> >> #2  0x00007ff1af359da7 in __funcs_on_exit () at src/exit/atexit.c:34
>> >> >> #3  0x00007ff1af35108f in exit (code=code at entry=0) at
>> >> src/exit/exit.c:29
>> >> >> #4  0x000055fe4224d4d2 in child_exit (code=0) at
>> protocol/child.c:1378
>> >> >> #5  die (sig=3) at protocol/child.c:1174
>> >> >> #6  <signal handler called>
>> >> >> #7  memset () at src/string/x86_64/memset.s:55
>> >> >> #8  0x000055fe4225d2ed in memset (__n=<optimized out>, __c=0,
>> >> >> __d=<optimized out>) at /usr/include/fortify/string.h:75
>> >> >> #9  pool_init_cp () at protocol/pool_connection_pool.c:83
>> >> >> #10 0x000055fe4224f5f0 in do_child (fds=fds at entry=0x7ff1a6aabae0) at
>> >> >> protocol/child.c:222
>> >> >> #11 0x000055fe42223ebe in fork_a_child (fds=0x7ff1a6aabae0, id=11) at
>> >> >> main/pgpool_main.c:863
>> >> >> #12 0x000055fe42229d90 in exec_child_restart (node_id=0,
>> >> >> failover_context=0x7ffcd98e8c50) at main/pgpool_main.c:4684
>> >> >> #13 failover () at main/pgpool_main.c:1739
>> >> >> #14 0x000055fe42228cd9 in sigusr1_interrupt_processor () at
>> >> >> main/pgpool_main.c:1507
>> >> >> #15 0x000055fe4222900f in check_requests () at
>> main/pgpool_main.c:4934
>> >> >> #16 0x000055fe4222ce53 in PgpoolMain
>> >> (discard_status=discard_status at entry=0
>> >> >> '\000', clear_memcache_oidmaps=clear_memcache_oidmaps at entry=0
>> '\000')
>> >> at
>> >> >> main/pgpool_main.c:649
>> >> >> #17 0x000055fe42222713 in main (argc=<optimized out>, argv=<optimized
>> >> out>)
>> >> >> at main/main.c:365
>> >> >>
>> >> >> Best regards,
>> >> >> Emond
>> >> > _______________________________________________
>> >> > pgpool-general mailing list
>> >> > pgpool-general at pgpool.net
>> >> > http://www.pgpool.net/mailman/listinfo/pgpool-general
>> >>
>>