[pgpool-general: 9143] Re: Another segmentation fault

Fri Jun 21 14:19:52 JST 2024

>> Ah, no problem. Here it is.
> 
> Thanks. I will run gdb against the core files using the load module.

Here is what I learned from the cores and log files so far.

>>> >> > I've managed to get some more logging for 2 crashes. The first, in
>>> build
>>> >> > 14132 (with coredump
>>> >> >
>>> >>
>>> core.pgpool.208.d678e2ac33af434981c93683b6c42020.102976.1718650919000000.bz2):
>>> >> > #0  pool_read_message_length (cp=0x7f53440bb548) at
>>> >> > protocol/pool_proto_modules.c:4483
>>> >> > #1  0x000056386aa1f530 in pool_do_auth (frontend=0x56386b6c4788,
>>> >> > cp=0x7f53440bb548) at auth/pool_auth.c:384
>>> >> > #2  0x000056386aa33ea1 in connect_backend (sp=0x56386b6c66d8,
>>> >> > frontend=0x56386b6c4788) at protocol/child.c:1102
>>> >> > #3  0x000056386aa36042 in get_backend_connection
>>> >> (frontend=0x56386b6c4788)
>>> >> > at protocol/child.c:2111
>>> >> > #4  0x000056386aa31fd5 in do_child (fds=0x56386b67a320) at
>>> >> > protocol/child.c:416
>>> >> > #5  0x000056386a9f7a4c in fork_a_child (fds=0x56386b67a320, id=30) at
>>> >> > main/pgpool_main.c:863
>>> >> > #6  0x000056386a9fa9ed in reaper () at main/pgpool_main.c:2004
>>> >> > #7  0x000056386aa0340e in check_requests () at main/pgpool_main.c:4953
>>> >> > #8  0x000056386a9f732b in PgpoolMain (discard_status=0 '\000',
>>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:649
>>> >> > #9  0x000056386a9f49e6 in main (argc=2, argv=0x7ffeeb474048) at
>>> >> > main/main.c:365

First of all, I identified the segfaulted pid to be 144 since there
was only one line including "segmentation fault". Using gdb I got these:

(gdb) p *cp
$2 = {info = 0x7f533c074c08, slots = {0x0, 0x7f53440b1880, 0x7f53440b1838, 0x0 <repeats 125 times>}}
(gdb) p *Req_info
[snip]
  request_queue_tail = 5, main_node_id = 0, primary_node_id = 1, conn_counter = 3, switching = 0 '\000', 
[snip]
(gdb) p my_main_node_id
$4 = 1
(gdb) p pool_config->backend_desc->backend_info[0]
$5 = {backend_hostname = "172.29.30.1", '\000' <repeats 242 times>, backend_port = 5432, backend_status = CON_UP, 
[snip]
(gdb) p pool_config->backend_desc->backend_info[1]
$6 = {backend_hostname = "172.29.30.2", '\000' <repeats 242 times>, backend_port = 5432, backend_status = CON_UP, 
[snip]
(gdb) p pool_config->backend_desc->backend_info[2]
$7 = {backend_hostname = "172.29.30.3", '\000' <repeats 242 times>, backend_port = 5432, backend_status = CON_UP, 
[snip]

>From these:
1) backend 0, 1, 2 were up. 
2) However, the process 144 claimed that the first live backend id was 1 (my_main_node_id was 1).
3) Thus slot[0] was empty.

At the place where segfaulted:

	pool_read(CONNECTION(cp, MAIN_NODE_ID), &length0, sizeof(length0));

MAIN_NODE_ID returned 0 since main_node_id was 0.  Notice that
my_main_node_id != main_node_id. Usually they should be same. I think
the reason why they were different is, after my_main_node_id was
copied from main_node_id, main_node_id was changed by failback
event. Actually we see these in the log file.

(main_node_id was copied to my_main_node_id. Also at that point node 0 was down)
2024-06-17T19:01:58.473379+00:00 2024-06-17 21:01:58: pid 144: DEBUG:  initializing backend status

(tried to create a connection to backend 0)
2024-06-17T19:01:58.473425+00:00 2024-06-17 21:01:58: pid 144: DEBUG:  creating new connection to backend
2024-06-17T19:01:58.473458+00:00 2024-06-17 21:01:58: pid 144: DETAIL:  connecting 0 backend
2024-06-17T19:01:58.473499+00:00 2024-06-17 21:01:58: pid 144: DEBUG:  creating new connection to backend

(but it was skipped because pid 144 believed that node 0 was down)
2024-06-17T19:01:58.473534+00:00 2024-06-17 21:01:58: pid 144: DETAIL:  skipping backend slot 0 because backend_status = 3

(tried to create a connection to backend 1)
2024-06-17T19:01:58.473566+00:00 2024-06-17 21:01:58: pid 144: DEBUG:  creating new connection to backend
2024-06-17T19:01:58.473600+00:00 2024-06-17 21:01:58: pid 144: DETAIL:  connecting 1 backend

(tried to create a connection to backend 2)
2024-06-17T19:01:58.473632+00:00 2024-06-17 21:01:58: pid 144: DEBUG:  creating new connection to backend
2024-06-17T19:01:58.473661+00:00 2024-06-17 21:01:58: pid 144: DETAIL:  connecting 2 backend

(node 0 was fail back)
2024-06-17T19:01:58.949957+00:00 2024-06-17 21:01:58: pid 1: LOG:  failover: set new main node: 0

Probably at the same timing pid 144 tried to execute:
	pool_read(CONNECTION(cp, MAIN_NODE_ID), &length0, sizeof(length0));

and segfaulted because MAIN_NODE_ID returned 0.  To fix this, I think
MAIN_NODE_ID (actually pool_virtual_main_db_node_id()) should be
changed so that it returns my_main_node_id instead of
REAL_MAIN_NODE_ID. Attached patch (fix_main_node_id.patch) does it.

>>> >> > The second, in build 14136:

I examin 14136.log.bz2 and suppose the pid is 15 (there are multiple
segfaults in the log and I am not sure 15 is right pid. anyway...)
>>> core.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000.bz2.
>>> >> > #0  0x0000561e716f1fa7 in pool_create_cp () at
>>> >> > protocol/pool_connection_pool.c:293
>>> >> > #1  0x0000561e716ddbb5 in connect_backend (sp=0x561e726d06b8,
>>> >> > frontend=0x561e726ce768) at protocol/child.c:1051
>>> >> > #2  0x0000561e716e0042 in get_backend_connection
>>> >> (frontend=0x561e726ce768)
>>> >> > at protocol/child.c:2111
>>> >> > #3  0x0000561e716dbfd5 in do_child (fds=0x561e72684320) at
>>> >> > protocol/child.c:416
>>> >> > #4  0x0000561e716a1a4c in fork_a_child (fds=0x561e72684320, id=1) at
>>> >> > main/pgpool_main.c:863
>>> >> > #5  0x0000561e716a0e30 in PgpoolMain (discard_status=0 '\000',
>>> >> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>>> >> > #6  0x0000561e7169e9e6 in main (argc=2, argv=0x7ffcf8eef198) at
>>> >> > main/main.c:365

In this case it seems all backend were down:
gdb ~/pgpool ~/ore.pgpool.208.6a941167afab4df68e5f050ef29c7d1b.5129.1718739212000000
[snip]
(gdb) p *Req_info
[snip]
  main_node_id = -1, primary_node_id = 0, conn_counter = 10, 
[snip]
(gdb) p pool_config->backend_desc->backend_info[0]
$4 = {backend_hostname = "172.29.30.1", '\000' <repeats 242 times>, 
  backend_port = 5432, backend_status = CON_DOWN, 
[snip]
(gdb) p pool_config->backend_desc->backend_info[1]
$5 = {backend_hostname = "172.29.30.2", '\000' <repeats 242 times>, 
  backend_port = 5432, backend_status = CON_DOWN, 
[snip]
(gdb) p pool_config->backend_desc->backend_info[2]
$6 = {backend_hostname = "172.29.30.3", '\000' <repeats 242 times>, 
  backend_port = 5432, backend_status = CON_DOWN, 
[snip]

>From the place where the segfault happend:

		ereport(DEBUG1,
				(errmsg("creating connection pool"),
				 errdetail("user: %s database: %s closetime: %ld",
						   MAIN_CONNECTION(p)->sp->user,
						   MAIN_CONNECTION(p)->sp->database,
						   MAIN_CONNECTION(p)->closetime)));

apparently MAIN_CONNECTION is problem here since main_node_id = -1,
which indicates there's no usable backend. Usually at the very
begining of pgpoo child process, it checks whether all backend are
down and if so, it refuses to accept client connection. But this does
not prevent the segfaults afther the check. I think my
fix_segfault2.patch should fix the problem since it does not use
MAIN_CONNECTION macro here. I am going push the patch soon.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_main_node_id.patch
Type: text/x-patch
Size: 859 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240621/2a6f36e5/attachment.bin>