[pgpool-general: 9144] Re: Another segmentation fault

Emond Papegaaij emond.papegaaij at gmail.com
Fri Jun 21 15:30:13 JST 2024


Hi,

Thanks for the very detailed analysis. These kinds of concurrency issues
are very hard to get right. The tests we run are very good at triggering
these kinds of race conditions, because many steps are executed after each
other in very rapid succession. I've added the patch to our build.

You are right that in 14136 several processes segfaulted. I'm not sure if
the coredump was for pid 15 (you might be able to find the correct pid
using the timestamp from the coredump), but I've noticed in the cases where
several processes segfault in just a few seconds, the crashes are almost
always the same. Therefore I didn't include the other coredumps, as they
will give you the same backtrace and variables.

Best regards,
Emond

Op vr 21 jun 2024 om 07:19 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:

> >> Ah, no problem. Here it is.
> >
> > Thanks. I will run gdb against the core files using the load module.
>
> Here is what I learned from the cores and log files so far.
>
> >>> >> > I've managed to get some more logging for 2 crashes. The first, in
> >>> build
> >>> >> > 14132 (with coredump
> >>> >> >
> >>> >>
> >>>
> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):
> >>> >> > #0  pool_read_message_length (cp=0x7f53440bb548) at
> >>> >> > protocol/pool_proto_modules.c:4483
> >>> >> > #1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,
> >>> >> > cp=0x7f53440bb548) at auth/pool_auth.c:384
> >>> >> > #2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,
> >>> >> > frontend=0x56386b6c4788) at protocol/child.c:1102
> >>> >> > #3  0x000056386aa36042 in get_backend_connection
> >>> >> (frontend=0x56386b6c4788)
> >>> >> > at protocol/child.c:2111
> >>> >> > #4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at
> >>> >> > protocol/child.c:416
> >>> >> > #5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320,
> id=30) at
> >>> >> > main/pgpool_main.c:863
> >>> >> > #6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004
> >>> >> > #7  0x000056386aa0340e in check_requests () at
> main/pgpool_main.c:4953
> >>> >> > #8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',
> >>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
> >>> >> > #9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at
> >>> >> > main/main.c:365
>
> First of all, I identified the segfaulted pid to be 144 since there
> was only one line including "segmentation fault". Using gdb I got these:
>
> (gdb) p *cp
> $2 = {info = 0x7f533c074c08, slots = {0x0, 0x7f53440b1880, 0x7f53440b1838,
> 0x0 <repeats 125 times>}}
> (gdb) p *Req_info
> [snip]
>   request_queue_tail = 5, main_node_id = 0, primary_node_id = 1,
> conn_counter = 3, switching = 0 '\000',
> [snip]
> (gdb) p my_main_node_id
> $4 = 1
> (gdb) p pool_config->backend_desc->backend_info[0]
> $5 = {backend_hostname = "172.29.30.1", '\000' <repeats 242 times>,
> backend_port = 5432, backend_status = CON_UP,
> [snip]
> (gdb) p pool_config->backend_desc->backend_info[1]
> $6 = {backend_hostname = "172.29.30.2", '\000' <repeats 242 times>,
> backend_port = 5432, backend_status = CON_UP,
> [snip]
> (gdb) p pool_config->backend_desc->backend_info[2]
> $7 = {backend_hostname = "172.29.30.3", '\000' <repeats 242 times>,
> backend_port = 5432, backend_status = CON_UP,
> [snip]
>
> From these:
> 1) backend 0, 1, 2 were up.
> 2) However, the process 144 claimed that the first live backend id was 1
> (my_main_node_id was 1).
> 3) Thus slot[0] was empty.
>
> At the place where segfaulted:
>
>         pool_read(CONNECTION(cp, MAIN_NODE_ID), &length0, sizeof(length0));
>
> MAIN_NODE_ID returned 0 since main_node_id was 0.  Notice that
> my_main_node_id != main_node_id. Usually they should be same. I think
> the reason why they were different is, after my_main_node_id was
> copied from main_node_id, main_node_id was changed by failback
> event. Actually we see these in the log file.
>
> (main_node_id was copied to my_main_node_id. Also at that point node 0 was
> down)
> 2024-06-17T19:01:58.473379+00:00 2024-06-17 21:01:58: pid 144: DEBUG:
> initializing backend status
>
> (tried to create a connection to backend 0)
> 2024-06-17T19:01:58.473425+00:00 2024-06-17 21:01:58: pid 144: DEBUG:
> creating new connection to backend
> 2024-06-17T19:01:58.473458+00:00 2024-06-17 21:01:58: pid 144: DETAIL:
> connecting 0 backend
> 2024-06-17T19:01:58.473499+00:00 2024-06-17 21:01:58: pid 144: DEBUG:
> creating new connection to backend
>
> (but it was skipped because pid 144 believed that node 0 was down)
> 2024-06-17T19:01:58.473534+00:00 2024-06-17 21:01:58: pid 144: DETAIL:
> skipping backend slot 0 because backend_status = 3
>
> (tried to create a connection to backend 1)
> 2024-06-17T19:01:58.473566+00:00 2024-06-17 21:01:58: pid 144: DEBUG:
> creating new connection to backend
> 2024-06-17T19:01:58.473600+00:00 2024-06-17 21:01:58: pid 144: DETAIL:
> connecting 1 backend
>
> (tried to create a connection to backend 2)
> 2024-06-17T19:01:58.473632+00:00 2024-06-17 21:01:58: pid 144: DEBUG:
> creating new connection to backend
> 2024-06-17T19:01:58.473661+00:00 2024-06-17 21:01:58: pid 144: DETAIL:
> connecting 2 backend
>
> (node 0 was fail back)
> 2024-06-17T19:01:58.949957+00:00 2024-06-17 21:01:58: pid 1: LOG:
> failover: set new main node: 0
>
> Probably at the same timing pid 144 tried to execute:
>         pool_read(CONNECTION(cp, MAIN_NODE_ID), &length0, sizeof(length0));
>
> and segfaulted because MAIN_NODE_ID returned 0.  To fix this, I think
> MAIN_NODE_ID (actually pool_virtual_main_db_node_id()) should be
> changed so that it returns my_main_node_id instead of
> REAL_MAIN_NODE_ID. Attached patch (fix_main_node_id.patch) does it.
>
> >>> >> > The second, in build 14136:
>
> I examin 14136.log.bz2 and suppose the pid is 15 (there are multiple
> segfaults in the log and I am not sure 15 is right pid. anyway...)
> >>>
> core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.
> >>> >> > #0  0x0000561e716f1fa7 in pool_create_cp () at
> >>> >> > protocol/pool_connection_pool.c:293
> >>> >> > #1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,
> >>> >> > frontend=0x561e726ce768) at protocol/child.c:1051
> >>> >> > #2  0x0000561e716e0042 in get_backend_connection
> >>> >> (frontend=0x561e726ce768)
> >>> >> > at protocol/child.c:2111
> >>> >> > #3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at
> >>> >> > protocol/child.c:416
> >>> >> > #4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1)
> at
> >>> >> > main/pgpool_main.c:863
> >>> >> > #5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',
> >>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
> >>> >> > #6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at
> >>> >> > main/main.c:365
>
> In this case it seems all backend were down:
> gdb ~/pgpool
> ~/ore.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000
> [snip]
> (gdb) p *Req_info
> [snip]
>   main_node_id = -1, primary_node_id = 0, conn_counter = 10,
> [snip]
> (gdb) p pool_config->backend_desc->backend_info[0]
> $4 = {backend_hostname = "172.29.30.1", '\000' <repeats 242 times>,
>   backend_port = 5432, backend_status = CON_DOWN,
> [snip]
> (gdb) p pool_config->backend_desc->backend_info[1]
> $5 = {backend_hostname = "172.29.30.2", '\000' <repeats 242 times>,
>   backend_port = 5432, backend_status = CON_DOWN,
> [snip]
> (gdb) p pool_config->backend_desc->backend_info[2]
> $6 = {backend_hostname = "172.29.30.3", '\000' <repeats 242 times>,
>   backend_port = 5432, backend_status = CON_DOWN,
> [snip]
>
> From the place where the segfault happend:
>
>                 ereport(DEBUG1,
>                                 (errmsg("creating connection pool"),
>                                  errdetail("user: %s database: %s
> closetime: %ld",
>
>  MAIN_CONNECTION(p)->sp->user,
>
>  MAIN_CONNECTION(p)->sp->database,
>
>  MAIN_CONNECTION(p)->closetime)));
>
> apparently MAIN_CONNECTION is problem here since main_node_id = -1,
> which indicates there's no usable backend. Usually at the very
> begining of pgpoo child process, it checks whether all backend are
> down and if so, it refuses to accept client connection. But this does
> not prevent the segfaults afther the check. I think my
> fix_segfault2.patch should fix the problem since it does not use
> MAIN_CONNECTION macro here. I am going push the patch soon.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240621/b0c74612/attachment.htm>


More information about the pgpool-general mailing list