[pgpool-general: 9127] Re: Another segmentation fault

Tue Jun 11 20:53:35 JST 2024

> Hi,
> 
>> Hi,
>> 
>> For the first backtrace (in protocol/child.c:1076), it tries to access slot
>> 2, which is null:
>> (gdb) info locals
>> frontend_auth_cxt = 0x55803d0ff040 <errordata+192>
>> oldContext = 0x55803eacd070
>> save_exception_stack = 0x7ffc8cdd0fb0
>> save_context_stack = 0x0
>> local_sigjmp_buf = {{__jmpbuf = {0, -4847230081711116359, 0,
>> 140722671836608, 94009268320888, 140437064355872, -4847230081734185031,
>> -1710719621514694727}, __mask_was_saved = 0, __saved_mask = {__val =
>> {12884901889, 94009295947704, 94009295947728, 0, 140722671782208,
>> 94009266127820, 60129542144, 140722671782336,
>>         94009266128191, 0, 0, 94009266967008, 60129543624, 140722671782304,
>> 94009265945097, 94009266967017}}}}
>> backend = 0x7fba0c407168
>> topmem_sp = 0x7fba0c4037a0
>> topmem_sp_set = 1 '\001'
>> i = 2
>> 
>> (gdb) p *backend
>> $5 = {
>>   info = 0x7fba04272c08,
>>   slots = {0x7fba0c403a00, 0x7fba0c403a30, 0x0 <repeats 126 times>}
>> }
>> (gdb) p *backend->slots[0]
>> $6 = {
>>   sp = 0x7fba0c4037a0,
>>   pid = 574235237,
>>   key = 1951614039,
>>   con = 0x7fba0c406548,
>>   closetime = 0
>> }
>> (gdb) p *backend->slots[1]
>> $7 = {
>>   sp = 0x7fba0c4037a0,
>>   pid = 21,
>>   key = 1024,
>>   con = 0x7fba0c40b9a8,
>>   closetime = 0
>> }
>> (gdb) p *backend->slots[2]
>> Cannot access memory at address 0x0
> 
> Thank you for the info.  This is really weird. By the time when the
> execution reach here, backend->slots[2] must be filled like slot[0]
> and slot[1] since VALID_BACKEND(2) returns true which means the
> connection to backend 2 is valid. I need more time to investigate
> this.

>>> > #0  connect_backend (sp=0x55803eb0a6b8, frontend=0x55803eb08768) at
>>> > protocol/child.c:1076
>>> > #1  0x000055803ce3d02a in get_backend_connection
>>> (frontend=0x55803eb08768)
>>> > at protocol/child.c:2112
>>> > #2  0x000055803ce38fd5 in do_child (fds=0x55803eabea90) at
>>> > protocol/child.c:416
>>> > #3  0x000055803cdfea4c in fork_a_child (fds=0x55803eabea90, id=13) at
>>> > main/pgpool_main.c:863
>>> > #4  0x000055803cdfde30 in PgpoolMain (discard_status=0 '\000',
>>> > clear_memcache_oidmaps=0 '\000') at main/pgpool_main.c:561
>>> > #5  0x000055803cdfb9e6 in main (argc=2, argv=0x7ffc8cdddda8) at
>>> > main/main.c:365

I think I found a possible code path in the first case (crash at
child.c:1076). Unfortunately I couldn't find a test case to reproduce
it reliably and my thought is purely theoretical one. Also it seems
the case heavily depends on timing and probably it's relatively rare
case. If you find the case very seldom, maybe that's the reason.

Anyway, here is my thought: new_connection() has a problem, which is
called in the code path
connect_backend()->pool_create_cp()->new_connection().  In
new_connection(), there's a code fragment:

		/*
		 * Make sure that the global backend status in the shared memory
		 * agrees the local status checked by VALID_BACKEND. It is possible
		 * that the local status is up, while the global status has been
		 * changed to down by failover.
		 */
A-->		if (BACKEND_INFO(i).backend_status != CON_UP &&
			BACKEND_INFO(i).backend_status != CON_CONNECT_WAIT)
		{
			ereport(DEBUG1,
					(errmsg("creating new connection to backend"),
					 errdetail("skipping backend slot %d because global backend_status = %d",
							   i, BACKEND_INFO(i).backend_status)));

			/* sync local status with global status */
B-->			*(my_backend_status[i]) = BACKEND_INFO(i).backend_status;
			continue;
		}

Here BACKEND_INFO(i).backend_status is the backend status in the
*shared memory* and can be changed by other process. Pgpool tried to
find the changes of the backend status in the shared memory (A). If it
has been changed, Pgpool copies the new status into
my_backend_status[i] which is in the process local memory (B). My
guess is, pgpool took the code path when backend id is 2 (i == 2), but
the BACKEND_INFO(i).backend_status had been changed to CON_UP or
CON_CONNECT_WAIT by other process between A and B. Later on at
child.c:1076:

	pool_set_db_node_id(CONNECTION(backend, i), i);

was executed because VALID_BACKEND, which referes to
*(my_backend_status[i]), returned true (status is CON_UP or
CONNECT_WAIT).  Then it crashes because CONNECTION(backend, i) was not
set in new_connection().

The deal with the race condition I think
BACKEND_INFO(i).backend_status needs to be copied into local memory
and it should be evaluated, rather than directory reading it from
shared memory.

Also there is a line 1072 in child.c:

if (VALID_BACKEND(i))

It would better to add more check that (CONNECTION_SLOT(i) is not
NULL).

Attached is the patch to implement above.

>> For the second crash (in pool_connection_pool.c:326), all slots are null:
>> (gdb) info locals
>> i = 0
>> freed = 0
>> closetime = 0
>> oldestp = 0x7f474f36e168
>> ret = 0x0
>> info = 0x55b511283469 <opt_sort>
>> p = 0x7f474f36e168
>> 
>> (gdb) p *p
>> $7 = {
>>   info = 0x7f47470d5c08,
>>   slots = {0x0 <repeats 128 times>}
>> }
>> 
>> (gdb) p p->slots[0]
>> $10 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> (gdb) p p->slots[1]
>> $11 = (POOL_CONNECTION_POOL_SLOT *) 0x0
>> (gdb) p p->slots[2]
>> $12 = (POOL_CONNECTION_POOL_SLOT *) 0x0
> 
> This is strange too. How come p->slots[0-2] is 0 if VALID_BACKEND
> returns true. I will look into this.

I will continue to work on the second crash case.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v1-0001-Fix-segfault-in-a-child-process.patch
Type: text/x-patch
Size: 3738 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240611/5663f53c/attachment.bin>