[pgpool-general: 9072] Re: Segmentation after switchover
Emond Papegaaij
emond.papegaaij at gmail.com
Thu Apr 4 16:10:58 JST 2024
Op do 4 apr 2024 om 03:22 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> > I dove into the code, and I think I've found the cause of the error. Just
> > prior to crashing, it reports "find_primary_node:
> > make_persistent_db_connection_noerror failed on node 0". This must come
> > from pgpool_main.c:2782. This means that slots[0] is NULL. Then, at
> > pgpool_main.c:2791 it enters verify_backend_node_status with this slots
> > array. At lines 2569-2579 it loops over these slots,
> > calling get_server_version for every slot, including slots[0], which is
> > NULL. This crashes when get_server_version calls get_query_result, which
> > tries to dereference slots[0]->con. At pgpool_main.c:2456 there is an
> > explicit check for NULL, this is missing in the other for loop, but it is
> > also missing at line 2609.
>
> But there's a check at line 2604 of pgpool_main.c:
>
> if (pool_node_status[j] == POOL_NODE_STATUS_STANDBY)
>
> If pool_node_status[j] is POOL_NODE_STATUS_STANDBY, the target node
> (0) must be alive in the past. I suspect node 0 goes down after the
> pool_node_status[j] was updated. I should have checked slots
> availability before calling get_query_result at 2609.
>
I wasn't sure about line 2609, but adding a check does make sense. The loop
at lines 2569-2579 definitely is broken. This also is where the segfault
happens at this moment. I've attached a patch (against 4.5.1) that should
address this issue.
As of crash in health_check.c, I think I have found the cause. The
> connection info is cached in HealthCheckMemoryContext, which is
> pointed to by "slot" (a static variable). When an error occurred,
> ereport(ERROR) jumps to line 159. Then the code proceeds to the for
> loop starting at line 171. At line 174
> MemoryContextResetAndDeleteChildren(HealthCheckMemoryContext) is
> called and the connection info is discarded [1]. Problem is, the value
> of "slot" remains, which means that slot points to freed memory. We
> should have cleared slot there.
>
> Same issue is found in pool_worker_child.c.
>
> Attached is the patch for the above.
>
Great. I've added the patch to our build, including the attached patch and
I'm rerunning the tests. I did have to alter the patch for
pool_worker_child.c a bit to make it apply on 4.5.1.
Best regards,
Emond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240404/7d389373/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: segfault2.patch
Type: text/x-patch
Size: 403 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240404/7d389373/attachment.bin>
More information about the pgpool-general
mailing list