[pgpool-hackers: 3259] Re: Segfault in a race condition
Tatsuo Ishii
ishii at sraoss.co.jp
Thu Feb 28 08:14:58 JST 2019
> On Wed, 27 Feb 2019 08:25:51 +0900 (JST)
> Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>
>> > Hi,
>> >
>> > I found another race condition in 3.6.15 causing a segfault, which is
>> > reported by our customer.
>> >
>> > On Tue, 08 Jan 2019 17:04:00 +0900 (JST)
>> > Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> >
>> >> I found a segfault could happen in a race condition:
>> >>
>> >> 1) frontend tries to connect to Pgpool-II
>> >>
>> >> 2) there's no existing connection cache
>> >>
>> >> 3) try to create new backend connections by calling connect_backend()
>> >>
>> >> 4) inside connect_backend(), pool_create_cp() gets called
>> >>
>> >> 5) pool_create_cp() calls new_connection()
>> >>
>> >> 6) failover occurs and the global backend status is set to down, but
>> >> the pgpool main does not send kill signal to the child process yet
>> >>
>> >> 7) inside new_connection() after checking VALID_BACKEND, it checks the
>> >> global backend status and finds it is set to down status, so that
>> >> it returns without creating new connection slot
>> >>
>> >> 8) connect_backend() continues and accesses the downed connection slot
>> >> because local status says it's alive, which results in a segfault.
>> >
>> > The situation is almost the same to above except that the segfault
>> > occurs in pool_do_auth(). (See backtrace and log below)
>> >
>> > I guess pool_do_auth was called before Req_info->master_node_id was updated
>> > in failover(), so MASTER_CONNECTION(cp) was referring the downed connection
>> > and MASTER_CONNECTION(cp)->sp caused the segfault.
>>
>> The situation is different in that the segfault explained in
>> [pgpool-hackers: 3214] was caused by local node status was too old
>> (the global status was up-to-date), while in this case the global
>> status is not yet updated. So we cannot employ the same fix as before.
>
> Yes, I agree. The problem is same in the sense that there is inconsistency
> between global and private status, but the way to fix should be different.
>
>> I think the possible fix would be, checking Req_info->switching = true
>> before referring MASTER_CONNECTION macro. If it's true, refuse to
>> accept new connection.
>>
>> What do you think?
>
> This way will be able to avoid the segfault, but I think a better way
> is raising a error only when Req_info->master_node_id (or primary_node_id)
> is changes due to a failover during accepting a new connection. Attached
> is a patch in this way, and I confirmed the segfault disappeared after
> applying this ** in native replication mode**.
Why is this way better than checking Req_info->switching?
> # I can reproduce the segfault by executing pcp_detach_node and pcp_attach_node
> # for node 0 repeatedly during running pgbench with -C option.
> # I useed child_max_connections = 5 to make pgpool create new connections frequently.
>
> However, in streaming-repliction mode, I found another segfault by the
> same way to reproduce! The location was MASTER_CONNECTION or TSTATE, and
> the segfault occured after pcp_attach_node instead of pcp_dettach_node.
>
> My guess is as below:
>
> 1. When a new connection was accepted, the backend of node 0 was in down, so slot[0] is null.
> 2. During processing a query, failback happened and node 0 became a new primary.
> 3. Then, when accessing slots[MASTER_NODE_ID (=0], this caused the segfault.
>
> I don't have good idea for fixing this for now... Just a report.
Thanks. Looks to be a hard problem...
> The backtrace is here:
> =======
> (gdb) bt
> #0 0x00005651d9733872 in backend_cleanup (backend=<optimized out>, frontend_invalid=frontend_invalid at entry=0 '\000', frontend=0x5651d9a7b960 <child_frontend>) at protocol/child.c:468
> #1 0x00005651d97366bd in do_child (fds=fds at entry=0x5651dac73340) at protocol/child.c:417
> #2 0x00005651d97105a7 in fork_a_child (fds=0x5651dac73340, id=1) at main/pgpool_main.c:659
> #3 0x00005651d97110dd in reaper () at main/pgpool_main.c:2690
> #4 0x00005651d9717f8f in PgpoolMain (discard_status=<optimized out>, clear_memcache_oidmaps=<optimized out>) at main/pgpool_main.c:451
> #5 0x00005651d970eb32 in main (argc=<optimized out>, argv=0x7ffd862f9da8) at main/main.c:349
> (gdb) l
> 463 bool cache_connection = false;
> 464
> 465 if (backend == NULL)
> 466 return false;
> 467
> 468 sp = MASTER_CONNECTION(backend)->sp;
> 469
> 470 /*
> 471 * cach connection if connection cache configuration parameter is enabled
> 472 * and frontend connection is not invalid
> ======
>
> ... and another case:
> ===================
> gdb) bt
> #0 0x000055e08d299097 in ReadyForQuery (frontend=frontend at entry=0x55e08e961508, backend=backend at entry=0x7fe0fefdda18, send_ready=send_ready at entry=1 '\001',
> cache_commit=cache_commit at entry=1 '\001') at protocol/pool_proto_modules.c:1909
> #1 0x000055e08d29ae70 in ProcessBackendResponse (frontend=frontend at entry=0x55e08e961508, backend=backend at entry=0x7fe0fefdda18, state=state at entry=0x7ffe62767fdc,
> num_fields=num_fields at entry=0x7ffe62767fda) at protocol/pool_proto_modules.c:2904
> #2 0x000055e08d28cef9 in pool_process_query (frontend=0x55e08e961508, backend=0x7fe0fefdda18, reset_request=reset_request at entry=0) at protocol/pool_process_query.c:321
> #3 0x000055e08d2876aa in do_child (fds=fds at entry=0x55e08e960340) at protocol/child.c:414
> #4 0x000055e08d2615a7 in fork_a_child (fds=0x55e08e960340, id=5) at main/pgpool_main.c:659
> #5 0x000055e08d2620dd in reaper () at main/pgpool_main.c:2690
> #6 0x000055e08d268f8f in PgpoolMain (discard_status=<optimized out>, clear_memcache_oidmaps=<optimized out>) at main/pgpool_main.c:451
> #7 0x000055e08d25fb32 in main (argc=<optimized out>, argv=0x7ffe62775508) at main/main.c:349
> (gdb) l
> 1904 return POOL_END;
> 1905
> 1906 /*
> 1907 * Set transaction state for each node
> 1908 */
> 1909 state = TSTATE(backend,
> 1910 MASTER_SLAVE ? PRIMARY_NODE_ID : REAL_MASTER_NODE_ID);
> 1911
> 1912 for (i = 0; i < NUM_BACKENDS; i++)
> 1913 {
> =========
>
>
>
>
>> > Here is the backtrace from core:
>> > =================================
>> > Core was generated by `pgpool: accept connection '.
>> > Program terminated with signal 11, Segmentation fault.
>> > #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
>> > at auth/pool_auth.c:77
>> > 77 protoMajor = MASTER_CONNECTION(cp)->sp->major;
>> > Missing separate debuginfos, use: debuginfo-install libmemcached-0.31-1.1.el6.x86_64
>> > (gdb) bt
>> > #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
>> > at auth/pool_auth.c:77
>> > #1 0x000000000042377f in connect_backend (sp=0x167ae78, frontend=0x1678f28)
>> > at protocol/child.c:954
>> > #2 0x0000000000423fdd in get_backend_connection (frontend=0x1678f28)
>> > at protocol/child.c:2396
>> > #3 0x0000000000424b94 in do_child (fds=0x16584f0) at protocol/child.c:337
>> > #4 0x000000000040682d in fork_a_child (fds=0x16584f0, id=372)
>> > at main/pgpool_main.c:758
>> > #5 0x0000000000409941 in failover () at main/pgpool_main.c:2102
>> > #6 0x000000000040cb40 in PgpoolMain (discard_status=<value optimized out>,
>> > clear_memcache_oidmaps=<value optimized out>) at main/pgpool_main.c:476
>> > #7 0x0000000000405c44 in main (argc=<value optimized out>,
>> > argv=<value optimized out>) at main/main.c:317
>> > (gdb) l
>> > 72 int authkind;
>> > 73 int i;
>> > 74 StartupPacket *sp;
>> > 75
>> > 76
>> > 77 protoMajor = MASTER_CONNECTION(cp)->sp->major;
>> > 78
>> > 79 kind = pool_read_kind(cp);
>> > 80 if (kind < 0)
>> > 81 ereport(ERROR,
>> > =======================================-
>> >
>> > Here is a snippet of the pgpool log. PID 5067 has a segfault.
>> > ==================
>> > (snip)
>> > 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: starting degeneration. shutdown host xxxxxxxx(xxxx)
>> > 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: Restart all children
>> > 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: LOG: new connection received
>> > 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: DETAIL: connecting host=xxxxxx port=xxxx
>> > (snip)
>> > 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exits with status 0
>> > 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exited with success and will not be restarted
>> > 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: WARNING: child process with pid: 5067 was terminated by segmentation fault
>> > 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5067 exited with success and will not be restarted
>> > 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exits with status 0
>> > 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exited with success and will not be restarted
>> > (snip)
>> > ===================
>> >
>> >
>> >
>> > Regards,
>> > --
>> > Yugo Nagata <nagata at sraoss.co.jp>
>
>
> --
> Yugo Nagata <nagata at sraoss.co.jp>
More information about the pgpool-hackers
mailing list