<div dir="ltr">Hi,<div><br></div><div>I'm very sorry. Since we've increased the logging of the container, we sometimes seem to get some gaps in the logs during reboots. I guess the virtual machine is having trouble getting all logs out to disk before it reboots. The attached logfile does contain the logs for the segmentation fault for this crash.</div><div><br></div><div>Best regards,</div><div>Emond</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op wo 19 jun 2024 om 14:40 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Sorry for delay. I looked into the pgpool-upgrade-test-crash.log.bz2,<br>
but failed to find "segmentation fault" string.<br>
<br>
Best reagards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS LLC<br>
English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>
Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
<br>
> Hi,<br>
> <br>
> Thanks for the patch. I'll add it to the build of our pgpool containers.<br>
> <br>
> Since I've added the previous patch, I've got the feeling that the number<br>
> of segmentation faults has been reduced. It's hard to say though if the<br>
> problem is really fixed, because some of these crashes happen very<br>
> infrequently.<br>
> <br>
> Yesterday, we did see 2 crashes, both with the same backtrace, which I<br>
> haven't seen before:<br>
> #0 pool_do_auth (frontend=0x55bc42c93788, cp=0x7ff7ca34f6b8) at<br>
> auth/pool_auth.c:349<br>
> #1 0x000055bc41856ea1 in connect_backend (sp=0x55bc42c956d8,<br>
> frontend=0x55bc42c93788) at protocol/child.c:1102<br>
> #2 0x000055bc41859042 in get_backend_connection (frontend=0x55bc42c93788)<br>
> at protocol/child.c:2111<br>
> #3 0x000055bc41854fd5 in do_child (fds=0x55bc42c49320) at<br>
> protocol/child.c:416<br>
> #4 0x000055bc4181aa4c in fork_a_child (fds=0x55bc42c49320, id=5) at<br>
> main/pgpool_main.c:863<br>
> #5 0x000055bc418256f7 in exec_child_restart<br>
> (failover_context=0x7ffdf033c6b0, node_id=0) at main/pgpool_main.c:4684<br>
> #6 0x000055bc4181d1dc in failover () at main/pgpool_main.c:1739<br>
> #7 0x000055bc4181c79e in sigusr1_interrupt_processor () at<br>
> main/pgpool_main.c:1507<br>
> #8 0x000055bc418263c3 in check_requests () at main/pgpool_main.c:4934<br>
> #9 0x000055bc4181a32b in PgpoolMain (discard_status=0 '\000',<br>
> clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649<br>
> #10 0x000055bc418179e6 in main (argc=2, argv=0x7ffdf0349878) at<br>
> main/main.c:365<br>
> <br>
> Looking at the backtrace, this happens right after the place where the<br>
> first patch made a change. The slots in 'cp' are {0x0, 0x7ff7ca3479b8,<br>
> 0x7ff7ca347988, 0x0 <repeats 125 times>}. I've attached the logs and the<br>
> coredump. This crash happens during a different test (in both cases). It<br>
> crashes in the test for our normal upgrade procedure. The crash happens<br>
> when the cluster is in a fully consistent configuration, all 3 nodes are up<br>
> and healthy. node 0 is both watchdog leader and running the primary<br>
> database. Because node 0 is the first to be upgraded, we need to initiate a<br>
> failover. For this, we first restart pgpool to force it to drop its leader<br>
> status and wait until a new leader has been elected. We then stop the<br>
> database to trigger the failover. The crash seems to happen when the<br>
> database started up again. In the attached logs, the crash was at node 0,<br>
> but in the other failure, the crash was at the exact same moment in the<br>
> test, but at node 2. I'm also seeing some suspicious backend status reports<br>
> from pcp_node_info at that time:<br>
> <br>
> Node 0 status: 172.29.30.1 5432 2 0.333333 up up standby standby 0 none<br>
> none 2024-06-13 12:05:00 at Thu Jun 13 12:05:00 CEST 2024<br>
> Node 0 status: 172.29.30.1 5432 3 0.333333 down up standby standby 130337<br>
> streaming async 2024-06-13 12:05:01 at Thu Jun 13 12:05:01 CEST<br>
> <br>
> ps. I had to resend the mail with the attachments bzip2ed because it was<br>
> hitting the size limit.<br>
> <br>
> Best regards,<br>
> Emond<br>
> <br>
> Op vr 14 jun 2024 om 02:28 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>
> <br>
>> > For the second crash (in pool_connection_pool.c:326), all slots are null:<br>
>> > (gdb) info locals<br>
>> > i = 0<br>
>> > freed = 0<br>
>> > closetime = 0<br>
>> > oldestp = 0x7f474f36e168<br>
>> > ret = 0x0<br>
>> > info = 0x55b511283469 <opt_sort><br>
>> > p = 0x7f474f36e168<br>
>> ><br>
>> > (gdb) p *p<br>
>> > $7 = {<br>
>> > info = 0x7f47470d5c08,<br>
>> > slots = {0x0 <repeats 128 times>}<br>
>> > }<br>
>> ><br>
>> > (gdb) p p->slots[0]<br>
>> > $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>
>> > (gdb) p p->slots[1]<br>
>> > $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>
>> > (gdb) p p->slots[2]<br>
>> > $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0<br>
>><br>
>> Possible explanation is, Req_info->main_node_id, which is the smallest<br>
>> backend node id number being alive, was -1. From the log file just<br>
>> before pid 30 segfaults:<br>
>><br>
>> 2024-05-31T07:16:45.908900+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>
>> backend:0 is set to down status<br>
>> 2024-05-31T07:16:45.908939+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>
>> backend:0 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux 216dfd5e07f2"<br>
>> 2024-05-31T07:16:45.908987+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>
>> backend:1 is set to down status<br>
>> 2024-05-31T07:16:45.909017+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>
>> backend:1 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux 216dfd5e07f2"<br>
>> 2024-05-31T07:16:45.909044+02:00 2024-05-31 07:16:45: pid 1: LOG:<br>
>> backend:2 is set to down status<br>
>> 2024-05-31T07:16:45.909071+02:00 2024-05-31 07:16:45: pid 1: DETAIL:<br>
>> backend:2 is DOWN on cluster leader "<a href="http://172.29.30.2:5432" rel="noreferrer" target="_blank">172.29.30.2:5432</a> Linux 216dfd5<br>
>><br>
>> That means all backends were down, and Req_info->main_node_id should<br>
>> have been -1 at the time. Usually if all backend are down, pgpool will<br>
>> not accept connections from frontend. I think when the check was<br>
>> peformed, not all backend are down. Right after this all backend went<br>
>> down I guess. If so, at line 270 of pool_connection_pool.c:<br>
>><br>
>> 270 if (MAIN_CONNECTION(p) == NULL)<br>
>><br>
>> MAIN_CONNECTION (that is p->slots[-1] because main_node_id == -1)<br>
>> looked into garbage and above condition was not met. Then the code<br>
>> proceeded to line 326:<br>
>><br>
>> 326 pool_free_startup_packet(CONNECTION_SLOT(p,<br>
>> i)->sp);<br>
>><br>
>> Here CONNECTION_SLOT(p, i) is NULL and segfaulted.<br>
>><br>
>> I think using MAIN_CONNECTION macro to take care of connection<br>
>> poolings is incorrect. Which backend is alive in the each connection<br>
>> pool slot could be different from current alive backend. So I invented<br>
>> a function in_use_backend_id() for seaching live backend ids in a<br>
>> connection slot. Attached is the patch in this direction. I hope the<br>
>> patch to fix the second segfault case.<br>
>><br>
>> > I hope this helps.<br>
>> ><br>
>> > Best regards,<br>
>> > Emond<br>
>> ><br>
>> > Op di 4 jun 2024 om 04:56 schreef Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>>:<br>
>> ><br>
>> >> > No worries. I hope you had a good trip. Last night we triggered the<br>
>> last<br>
>> >> > crash again. Is there anything we can do to make it easier for you to<br>
>> >> find<br>
>> >> > the cause?<br>
>> >><br>
>> >> It would be helpful if you could share some variable values in the<br>
>> >> core file. Since I don't have the pgpool load module when you got the<br>
>> >> core, I cannot inspect the variables using the core you provided.<br>
>> >><br>
>> >> > #0 connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at<br>
>> >> > protocol/child.c:1076<br>
>> >> > #1 0x000055803ce3d02a in get_backend_connection<br>
>> >> (frontend=0x55803eb08768)<br>
>> >> > at protocol/child.c:2112<br>
>> >> > #2 0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at<br>
>> >> > protocol/child.c:416<br>
>> >> > #3 0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at<br>
>> >> > main/pgpool_main.c:863<br>
>> >> > #4 0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',<br>
>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561<br>
>> >> > #5 0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at<br>
>> >> > main/main.c:365<br>
>> >><br>
>> >> Around protocol/child.c:1076:<br>
>> >> /* set DB node id */<br>
>> >> pool_set_db_node_id(CONNECTION(backend,<br>
>> >> i), i);<br>
>> >><br>
>> >> I want to see the values in the "backend" struct. Since CONNECTION<br>
>> >> macro is used here, you have to do something like in a gdb session.<br>
>> >><br>
>> >> p *backend->slots[0]<br>
>> >> p *backend->slots[1]<br>
>> >> p *backend->slots[2]<br>
>> >><br>
>> >> Best reagards,<br>
>> >> --<br>
>> >> Tatsuo Ishii<br>
>> >> SRA OSS LLC<br>
>> >> English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>
>> >> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
>> >><br>
>><br>
</blockquote></div>