<div dir="ltr">Ah, no problem. Here it is.<div><br></div><div>Ps. I just noticed that the commit to apply your latest patch (fix_segfault2.patch) never got pushed to our repository. So we are using 4.5.2 with only v1-0001-Fix-segfault-in-a-child-process.patch applied. I've just pushed the second patch and I'll keep you updated on any changes in the builds.</div><div><br></div><div>Best regards,</div><div>Emond</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op do 20 jun 2024 om 13:19 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I meant the pgpool executable file. I just need<br>
/dest/usr/sbin/pgpool.<br>
<br>
> Hi,<br>
> <br>
> Probably, if you could explain what a 'load module' is, and how to get it<br>
> from a build. I run gdb inside a docker container that's also used to build<br>
> pgpool from source. I start it like this:<br>
> gdb /dest/usr/sbin/pgpool <path to coredump><br>
> <br>
> Would it be enough if I create a tarball of the entire /dest directory (its<br>
> the destination for make install)?<br>
> <br>
> Best regards,<br>
> Emond<br>
> <br>
> Op do 20 jun 2024 om 11:46 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>
> <br>
>> Thank for the info. Is it possible for you to share the pgpool load module?<br>
>> With it, I could inspect variable values in the core myself using gdb.<br>
>><br>
>> > Hi,<br>
>> ><br>
>> > I've managed to get some more logging for 2 crashes. The first, in build<br>
>> > 14132 (with coredump<br>
>> ><br>
>> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):<br>
>> > #0 pool_read_message_length (cp=0x7f53440bb548) at<br>
>> > protocol/pool_proto_modules.c:4483<br>
>> > #1 0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,<br>
>> > cp=0x7f53440bb548) at auth/pool_auth.c:384<br>
>> > #2 0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,<br>
>> > frontend=0x56386b6c4788) at protocol/child.c:1102<br>
>> > #3 0x000056386aa36042 in get_backend_connection<br>
>> (frontend=0x56386b6c4788)<br>
>> > at protocol/child.c:2111<br>
>> > #4 0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at<br>
>> > protocol/child.c:416<br>
>> > #5 0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at<br>
>> > main/pgpool_main.c:863<br>
>> > #6 0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004<br>
>> > #7 0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953<br>
>> > #8 0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',<br>
>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649<br>
>> > #9 0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at<br>
>> > main/main.c:365<br>
>> ><br>
>> > The second, in build 14136:<br>
>> > It seems the location of the crash moved a bit, since my previous report<br>
>> > due to the higher loglevel. It now crashes in a log statement. The<br>
>> coredump<br>
>> > that goes with this crash<br>
>> > is:<br>
>> core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.<br>
>> > #0 0x0000561e716f1fa7 in pool_create_cp () at<br>
>> > protocol/pool_connection_pool.c:293<br>
>> > #1 0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,<br>
>> > frontend=0x561e726ce768) at protocol/child.c:1051<br>
>> > #2 0x0000561e716e0042 in get_backend_connection<br>
>> (frontend=0x561e726ce768)<br>
>> > at protocol/child.c:2111<br>
>> > #3 0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at<br>
>> > protocol/child.c:416<br>
>> > #4 0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at<br>
>> > main/pgpool_main.c:863<br>
>> > #5 0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',<br>
>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>
>> > #6 0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at<br>
>> > main/main.c:365<br>
>> ><br>
>> > Best regards,<br>
>> > Emond<br>
>> ><br>
>> > Op do 20 jun 2024 om 09:05 schreef Emond Papegaaij <<br>
>> > <a href="mailto:emond.papegaaij@gmail.com" target="_blank">emond.papegaaij@gmail.com</a>>:<br>
>> ><br>
>> >> Hi,<br>
>> >><br>
>> >> I'm very sorry. Since we've increased the logging of the container, we<br>
>> >> sometimes seem to get some gaps in the logs during reboots. I guess the<br>
>> >> virtual machine is having trouble getting all logs out to disk before it<br>
>> >> reboots. The attached logfile does contain the logs for the segmentation<br>
>> >> fault for this crash.<br>
>> >><br>
>> >> Best regards,<br>
>> >> Emond<br>
>> >><br>
>> >> Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>
>> >><br>
>> >>> Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,<br>
>> >>> but failed to find "segmentation fault" string.<br>
>> >>><br>
>> >>> Best reagards,<br>
>> >>> --<br>
>> >>> Tatsuo Ishii<br>
>> >>> SRA OSS LLC<br>
>> >>> English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>
>> >>> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
>> >>><br>
>> >>> > Hi,<br>
>> >>> ><br>
>> >>> > Thanks for the patch. I'll add it to the build of our pgpool<br>
>> containers.<br>
>> >>> ><br>
>> >>> > Since I've added the previous patch, I've got the feeling that the<br>
>> >>> number<br>
>> >>> > of segmentation faults has been reduced. It's hard to say though if<br>
>> the<br>
>> >>> > problem is really fixed, because some of these crashes happen very<br>
>> >>> > infrequently.<br>
>> >>> ><br>
>> >>> > Yesterday, we did see 2 crashes, both with the same backtrace, which<br>
>> I<br>
>> >>> > haven't seen before:<br>
>> >>> > #0 pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at<br>
>> >>> > auth/pool_auth.c:349<br>
>> >>> > #1 0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,<br>
>> >>> > frontend=0x55bc42c93788) at protocol/child.c:1102<br>
>> >>> > #2 0x000055bc41859042 in get_backend_connection<br>
>> >>> (frontend=0x55bc42c93788)<br>
>> >>> > at protocol/child.c:2111<br>
>> >>> > #3 0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at<br>
>> >>> > protocol/child.c:416<br>
>> >>> > #4 0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at<br>
>> >>> > main/pgpool_main.c:863<br>
>> >>> > #5 0x000055bc418256f7 in exec_child_restart<br>
>> >>> > (failover_context=0x7ffdf033c6b0, node_id=0) at<br>
>> main/pgpool_main.c:4684<br>
>> >>> > #6 0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739<br>
>> >>> > #7 0x000055bc4181c79e in sigusr1_interrupt_processor () at<br>
>> >>> > main/pgpool_main.c:1507<br>
>> >>> > #8 0x000055bc418263c3 in check_requests () at<br>
>> main/pgpool_main.c:4934<br>
>> >>> > #9 0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',<br>
>> >>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649<br>
>> >>> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at<br>
>> >>> > main/main.c:365<br>
>> >>> ><br>
>> >>> > Looking at the backtrace, this happens right after the place where<br>
>> the<br>
>> >>> > first patch made a change. The slots in 'cp' are {0x0,<br>
>> 0x7ff7ca3479b8,<br>
>> >>> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and<br>
>> the<br>
>> >>> > coredump. This crash happens during a different test (in both<br>
>> cases). It<br>
>> >>> > crashes in the test for our normal upgrade procedure. The crash<br>
>> happens<br>
>> >>> > when the cluster is in a fully consistent configuration, all 3 nodes<br>
>> >>> are up<br>
>> >>> > and healthy. node 0 is both watchdog leader and running the primary<br>
>> >>> > database. Because node 0 is the first to be upgraded, we need to<br>
>> >>> initiate a<br>
>> >>> > failover. For this, we first restart pgpool to force it to drop its<br>
>> >>> leader<br>
>> >>> > status and wait until a new leader has been elected. We then stop the<br>
>> >>> > database to trigger the failover. The crash seems to happen when the<br>
>> >>> > database started up again. In the attached logs, the crash was at<br>
>> node<br>
>> >>> 0,<br>
>> >>> > but in the other failure, the crash was at the exact same moment in<br>
>> the<br>
>> >>> > test, but at node 2. I'm also seeing some suspicious backend status<br>
>> >>> reports<br>
>> >>> > from pcp_node_info at that time:<br>
>> >>> ><br>
>> >>> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0<br>
>> none<br>
>> >>> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024<br>
>> >>> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby<br>
>> >>> 130337<br>
>> >>> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST<br>
>> >>> ><br>
>> >>> > ps. I had to resend the mail with the attachments bzip2ed because it<br>
>> was<br>
>> >>> > hitting the size limit.<br>
>> >>> ><br>
>> >>> > Best regards,<br>
>> >>> > Emond<br>
>> >>> ><br>
>> >>> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a><br>
>> >:<br>
>> >>> ><br>
>> >>> >> > For the second crash (in pool_connection_pool.c:326), all slots<br>
>> are<br>
>> >>> null:<br>
>> >>> >> > (gdb) info locals<br>
>> >>> >> > i = 0<br>
>> >>> >> > freed = 0<br>
>> >>> >> > closetime = 0<br>
>> >>> >> > oldestp = 0x7f474f36e168<br>
>> >>> >> > ret = 0x0<br>
>> >>> >> > info = 0x55b511283469 <opt_sort><br>
>> >>> >> > p = 0x7f474f36e168<br>
>> >>> >> ><br>
>> >>> >> > (gdb) p *p<br>
>> >>> >> > $7 = {<br>
>> >>> >> > info = 0x7f47470d5c08,<br>
>> >>> >> > slots = {0x0 <repeats 128 times>}<br>
>> >>> >> > }<br>
>> >>> >> ><br>
>> >>> >> > (gdb) p p->slots[0]<br>
>> >>> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>
>> >>> >> > (gdb) p p->slots[1]<br>
>> >>> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>
>> >>> >> > (gdb) p p->slots[2]<br>
>> >>> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>
>> >>> >><br>
>> >>> >> Possible explanation is, Req_info->main_node_id, which is the<br>
>> smallest<br>
>> >>> >> backend node id number being alive, was -1. >From the log file just<br>
>> >>> >> before pid 30 segfaults:<br>
>> >>> >><br>
>> >>> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>
>> >>> >> backend:0 is set to down status<br>
>> >>> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>
>> >>> >> backend:0 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux<br>
>> >>> 216dfd5e07f2"<br>
>> >>> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>
>> >>> >> backend:1 is set to down status<br>
>> >>> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>
>> >>> >> backend:1 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux<br>
>> >>> 216dfd5e07f2"<br>
>> >>> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>
>> >>> >> backend:2 is set to down status<br>
>> >>> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>
>> >>> >> backend:2 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux 216dfd5<br>
>> >>> >><br>
>> >>> >> That means all backends were down, and Req_info->main_node_id should<br>
>> >>> >> have been -1 at the time. Usually if all backend are down, pgpool<br>
>> will<br>
>> >>> >> not accept connections from frontend. I think when the check was<br>
>> >>> >> peformed, not all backend are down. Right after this all backend<br>
>> went<br>
>> >>> >> down I guess. If so, at line 270 of pool_connection_pool.c:<br>
>> >>> >><br>
>> >>> >> 270 if (MAIN_CONNECTION(p) == NULL)<br>
>> >>> >><br>
>> >>> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)<br>
>> >>> >> looked into garbage and above condition was not met. Then the code<br>
>> >>> >> proceeded to line 326:<br>
>> >>> >><br>
>> >>> >> 326<br>
>> >>> pool_free_startup_packet(CONNECTION_SLOT(p,<br>
>> >>> >> i)->sp);<br>
>> >>> >><br>
>> >>> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.<br>
>> >>> >><br>
>> >>> >> I think using MAIN_CONNECTION macro to take care of connection<br>
>> >>> >> poolings is incorrect. Which backend is alive in the each connection<br>
>> >>> >> pool slot could be different from current alive backend. So I<br>
>> invented<br>
>> >>> >> a function in_use_backend_id() for seaching live backend ids in a<br>
>> >>> >> connection slot. Attached is the patch in this direction. I hope the<br>
>> >>> >> patch to fix the second segfault case.<br>
>> >>> >><br>
>> >>> >> > I hope this helps.<br>
>> >>> >> ><br>
>> >>> >> > Best regards,<br>
>> >>> >> > Emond<br>
>> >>> >> ><br>
>> >>> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <<br>
>> <a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>
>> >>> >> ><br>
>> >>> >> >> > No worries. I hope you had a good trip. Last night we triggered<br>
>> >>> the<br>
>> >>> >> last<br>
>> >>> >> >> > crash again. Is there anything we can do to make it easier for<br>
>> >>> you to<br>
>> >>> >> >> find<br>
>> >>> >> >> > the cause?<br>
>> >>> >> >><br>
>> >>> >> >> It would be helpful if you could share some variable values in<br>
>> the<br>
>> >>> >> >> core file. Since I don't have the pgpool load module when you got<br>
>> >>> the<br>
>> >>> >> >> core, I cannot inspect the variables using the core you provided.<br>
>> >>> >> >><br>
>> >>> >> >> > #0 connect_backend (sp=0x55803eb0a6b8,<br>
>> frontend=0x55803eb08768)<br>
>> >>> at<br>
>> >>> >> >> > protocol/child.c:1076<br>
>> >>> >> >> > #1 0x000055803ce3d02a in get_backend_connection<br>
>> >>> >> >> (frontend=0x55803eb08768)<br>
>> >>> >> >> > at protocol/child.c:2112<br>
>> >>> >> >> > #2 0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at<br>
>> >>> >> >> > protocol/child.c:416<br>
>> >>> >> >> > #3 0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90,<br>
>> >>> id=13) at<br>
>> >>> >> >> > main/pgpool_main.c:863<br>
>> >>> >> >> > #4 0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',<br>
>> >>> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>
>> >>> >> >> > #5 0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at<br>
>> >>> >> >> > main/main.c:365<br>
>> >>> >> >><br>
>> >>> >> >> Around protocol/child.c:1076:<br>
>> >>> >> >> /* set DB node id */<br>
>> >>> >> >><br>
>> >>> pool_set_db_node_id(CONNECTION(backend,<br>
>> >>> >> >> i), i);<br>
>> >>> >> >><br>
>> >>> >> >> I want to see the values in the "backend" struct. Since<br>
>> CONNECTION<br>
>> >>> >> >> macro is used here, you have to do something like in a gdb<br>
>> session.<br>
>> >>> >> >><br>
>> >>> >> >> p *backend->slots[0]<br>
>> >>> >> >> p *backend->slots[1]<br>
>> >>> >> >> p *backend->slots[2]<br>
>> >>> >> >><br>
>> >>> >> >> Best reagards,<br>
>> >>> >> >> --<br>
>> >>> >> >> Tatsuo Ishii<br>
>> >>> >> >> SRA OSS LLC<br>
>> >>> >> >> English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>
>> >>> >> >> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
>> >>> >> >><br>
>> >>> >><br>
>> >>><br>
>> >><br>
>><br>
</blockquote></div>