<div dir="ltr">Hi,<div><br></div><div>Probably, if you could explain what a 'load module' is, and how to get it from a build. I run gdb inside a docker container that's also used to build pgpool from source. I start it like this:</div><div>gdb /dest/usr/sbin/pgpool <path to coredump></div><div><br></div><div>Would it be enough if I create a tarball of the entire /dest directory (its the destination for make install)?</div><div><br></div><div>Best regards,</div><div>Emond</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op do 20 jun 2024 om 11:46 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thank for the info. Is it possible for you to share the pgpool load module?<br>

With it, I could inspect variable values in the core myself using gdb.<br>

<br>

> Hi,<br>

> <br>

> I've managed to get some more logging for 2 crashes. The first, in build<br>

> 14132 (with coredump<br>

> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):<br>

> #0  pool_read_message_length (cp=0x7f53440bb548) at<br>

> protocol/pool_proto_modules.c:4483<br>

> #1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,<br>

> cp=0x7f53440bb548) at auth/pool_auth.c:384<br>

> #2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,<br>

> frontend=0x56386b6c4788) at protocol/child.c:1102<br>

> #3  0x000056386aa36042 in get_backend_connection (frontend=0x56386b6c4788)<br>

> at protocol/child.c:2111<br>

> #4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at<br>

> protocol/child.c:416<br>

> #5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at<br>

> main/pgpool_main.c:863<br>

> #6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004<br>

> #7  0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953<br>

> #8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',<br>

> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649<br>

> #9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at<br>

> main/main.c:365<br>

> <br>

> The second, in build 14136:<br>

> It seems the location of the crash moved a bit, since my previous report<br>

> due to the higher loglevel. It now crashes in a log statement. The coredump<br>

> that goes with this crash<br>

> is: core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.<br>

> #0  0x0000561e716f1fa7 in pool_create_cp () at<br>

> protocol/pool_connection_pool.c:293<br>

> #1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,<br>

> frontend=0x561e726ce768) at protocol/child.c:1051<br>

> #2  0x0000561e716e0042 in get_backend_connection (frontend=0x561e726ce768)<br>

> at protocol/child.c:2111<br>

> #3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at<br>

> protocol/child.c:416<br>

> #4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at<br>

> main/pgpool_main.c:863<br>

> #5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',<br>

> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>

> #6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at<br>

> main/main.c:365<br>

> <br>

> Best regards,<br>

> Emond<br>

> <br>

> Op do 20 jun 2024 om 09:05 schreef Emond Papegaaij <<br>

> <a href="mailto:emond.papegaaij@gmail.com" target="_blank">emond.papegaaij@gmail.com</a>>:<br>

> <br>

>> Hi,<br>

>><br>

>> I'm very sorry. Since we've increased the logging of the container, we<br>

>> sometimes seem to get some gaps in the logs during reboots. I guess the<br>

>> virtual machine is having trouble getting all logs out to disk before it<br>

>> reboots. The attached logfile does contain the logs for the segmentation<br>

>> fault for this crash.<br>

>><br>

>> Best regards,<br>

>> Emond<br>

>><br>

>> Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>

>><br>

>>> Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,<br>

>>> but failed to find "segmentation fault" string.<br>

>>><br>

>>> Best reagards,<br>

>>> --<br>

>>> Tatsuo Ishii<br>

>>> SRA OSS LLC<br>

>>> English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>

>>> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

>>><br>

>>> > Hi,<br>

>>> ><br>

>>> > Thanks for the patch. I'll add it to the build of our pgpool containers.<br>

>>> ><br>

>>> > Since I've added the previous patch, I've got the feeling that the<br>

>>> number<br>

>>> > of segmentation faults has been reduced. It's hard to say though if the<br>

>>> > problem is really fixed, because some of these crashes happen very<br>

>>> > infrequently.<br>

>>> ><br>

>>> > Yesterday, we did see 2 crashes, both with the same backtrace, which I<br>

>>> > haven't seen before:<br>

>>> > #0  pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at<br>

>>> > auth/pool_auth.c:349<br>

>>> > #1  0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,<br>

>>> > frontend=0x55bc42c93788) at protocol/child.c:1102<br>

>>> > #2  0x000055bc41859042 in get_backend_connection<br>

>>> (frontend=0x55bc42c93788)<br>

>>> > at protocol/child.c:2111<br>

>>> > #3  0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at<br>

>>> > protocol/child.c:416<br>

>>> > #4  0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at<br>

>>> > main/pgpool_main.c:863<br>

>>> > #5  0x000055bc418256f7 in exec_child_restart<br>

>>> > (failover_context=0x7ffdf033c6b0, node_id=0) at main/pgpool_main.c:4684<br>

>>> > #6  0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739<br>

>>> > #7  0x000055bc4181c79e in sigusr1_interrupt_processor () at<br>

>>> > main/pgpool_main.c:1507<br>

>>> > #8  0x000055bc418263c3 in check_requests () at main/pgpool_main.c:4934<br>

>>> > #9  0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',<br>

>>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649<br>

>>> > #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at<br>

>>> > main/main.c:365<br>

>>> ><br>

>>> > Looking at the backtrace, this happens right after the place where the<br>

>>> > first patch made a change. The slots in 'cp' are {0x0, 0x7ff7ca3479b8,<br>

>>> > 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and the<br>

>>> > coredump. This crash happens during a different test (in both cases). It<br>

>>> > crashes in the test for our normal upgrade procedure. The crash happens<br>

>>> > when the cluster is in a fully consistent configuration, all 3 nodes<br>

>>> are up<br>

>>> > and healthy. node 0 is both watchdog leader and running the primary<br>

>>> > database. Because node 0 is the first to be upgraded, we need to<br>

>>> initiate a<br>

>>> > failover. For this, we first restart pgpool to force it to drop its<br>

>>> leader<br>

>>> > status and wait until a new leader has been elected. We then stop the<br>

>>> > database to trigger the failover. The crash seems to happen when the<br>

>>> > database started up again. In the attached logs, the crash was at node<br>

>>> 0,<br>

>>> > but in the other failure, the crash was at the exact same moment in the<br>

>>> > test, but at node 2. I'm also seeing some suspicious backend status<br>

>>> reports<br>

>>> > from pcp_node_info at that time:<br>

>>> ><br>

>>> > Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0 none<br>

>>> > none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024<br>

>>> > Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby<br>

>>> 130337<br>

>>> > streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST<br>

>>> ><br>

>>> > ps. I had to resend the mail with the attachments bzip2ed because it was<br>

>>> > hitting the size limit.<br>

>>> ><br>

>>> > Best regards,<br>

>>> > Emond<br>

>>> ><br>

>>> > Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>

>>> ><br>

>>> >> > For the second crash (in pool_connection_pool.c:326), all slots are<br>

>>> null:<br>

>>> >> > (gdb) info locals<br>

>>> >> > i = 0<br>

>>> >> > freed = 0<br>

>>> >> > closetime = 0<br>

>>> >> > oldestp = 0x7f474f36e168<br>

>>> >> > ret = 0x0<br>

>>> >> > info = 0x55b511283469 <opt_sort><br>

>>> >> > p = 0x7f474f36e168<br>

>>> >> ><br>

>>> >> > (gdb) p *p<br>

>>> >> > $7 = {<br>

>>> >> >   info = 0x7f47470d5c08,<br>

>>> >> >   slots = {0x0 <repeats 128 times>}<br>

>>> >> > }<br>

>>> >> ><br>

>>> >> > (gdb) p p->slots[0]<br>

>>> >> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>

>>> >> > (gdb) p p->slots[1]<br>

>>> >> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>

>>> >> > (gdb) p p->slots[2]<br>

>>> >> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>

>>> >><br>

>>> >> Possible explanation is, Req_info->main_node_id, which is the smallest<br>

>>> >> backend node id number being alive, was -1. From the log file just<br>

>>> >> before pid 30 segfaults:<br>

>>> >><br>

>>> >> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>

>>> >> backend:0 is set to down status<br>

>>> >> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>

>>> >> backend:0 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux<br>

>>> 216dfd5e07f2"<br>

>>> >> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>

>>> >> backend:1 is set to down status<br>

>>> >> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>

>>> >> backend:1 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux<br>

>>> 216dfd5e07f2"<br>

>>> >> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>

>>> >> backend:2 is set to down status<br>

>>> >> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>

>>> >> backend:2 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux 216dfd5<br>

>>> >><br>

>>> >> That means all backends were down, and Req_info->main_node_id should<br>

>>> >> have been -1 at the time. Usually if all backend are down, pgpool will<br>

>>> >> not accept connections from frontend. I think when the check was<br>

>>> >> peformed, not all backend are down. Right after this all backend went<br>

>>> >> down I guess. If so, at line 270 of pool_connection_pool.c:<br>

>>> >><br>

>>> >> 270                 if (MAIN_CONNECTION(p) == NULL)<br>

>>> >><br>

>>> >> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)<br>

>>> >> looked into garbage and above condition was not met. Then the code<br>

>>> >> proceeded to line 326:<br>

>>> >><br>

>>> >>  326<br>

>>>  pool_free_startup_packet(CONNECTION_SLOT(p,<br>

>>> >> i)->sp);<br>

>>> >><br>

>>> >> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.<br>

>>> >><br>

>>> >> I think using MAIN_CONNECTION macro to take care of connection<br>

>>> >> poolings is incorrect. Which backend is alive in the each connection<br>

>>> >> pool slot could be different from current alive backend. So I invented<br>

>>> >> a function in_use_backend_id() for seaching live backend ids in a<br>

>>> >> connection slot. Attached is the patch in this direction. I hope the<br>

>>> >> patch to fix the second segfault case.<br>

>>> >><br>

>>> >> > I hope this helps.<br>

>>> >> ><br>

>>> >> > Best regards,<br>

>>> >> > Emond<br>

>>> >> ><br>

>>> >> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>

>>> >> ><br>

>>> >> >> > No worries. I hope you had a good trip. Last night we triggered<br>

>>> the<br>

>>> >> last<br>

>>> >> >> > crash again. Is there anything we can do to make it easier for<br>

>>> you to<br>

>>> >> >> find<br>

>>> >> >> > the cause?<br>

>>> >> >><br>

>>> >> >> It would be helpful if you could share some variable values in the<br>

>>> >> >> core file. Since I don't have the pgpool load module when you got<br>

>>> the<br>

>>> >> >> core, I cannot inspect the variables using the core you provided.<br>

>>> >> >><br>

>>> >> >> > #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768)<br>

>>> at<br>

>>> >> >> > protocol/child.c:1076<br>

>>> >> >> > #1  0x000055803ce3d02a in get_backend_connection<br>

>>> >> >> (frontend=0x55803eb08768)<br>

>>> >> >> > at protocol/child.c:2112<br>

>>> >> >> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at<br>

>>> >> >> > protocol/child.c:416<br>

>>> >> >> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90,<br>

>>> id=13) at<br>

>>> >> >> > main/pgpool_main.c:863<br>

>>> >> >> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',<br>

>>> >> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>

>>> >> >> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at<br>

>>> >> >> > main/main.c:365<br>

>>> >> >><br>

>>> >> >> Around protocol/child.c:1076:<br>

>>> >> >>                                 /* set DB node id */<br>

>>> >> >><br>

>>>  pool_set_db_node_id(CONNECTION(backend,<br>

>>> >> >> i), i);<br>

>>> >> >><br>

>>> >> >> I want to see the values in the "backend" struct. Since CONNECTION<br>

>>> >> >> macro is used here, you have to do something like in a gdb session.<br>

>>> >> >><br>

>>> >> >> p *backend->slots[0]<br>

>>> >> >> p *backend->slots[1]<br>

>>> >> >> p *backend->slots[2]<br>

>>> >> >><br>

>>> >> >> Best reagards,<br>

>>> >> >> --<br>

>>> >> >> Tatsuo Ishii<br>

>>> >> >> SRA OSS LLC<br>

>>> >> >> English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>

>>> >> >> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

>>> >> >><br>

>>> >><br>

>>><br>

>><br>

</blockquote></div>