[pgpool-hackers: 3256] Re: Segfault in a race condition
Tatsuo Ishii
ishii at sraoss.co.jp
Wed Feb 27 08:25:51 JST 2019
> Hi,
>
> I found another race condition in 3.6.15 causing a segfault, which is
> reported by our customer.
>
> On Tue, 08 Jan 2019 17:04:00 +0900 (JST)
> Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>
>> I found a segfault could happen in a race condition:
>>
>> 1) frontend tries to connect to Pgpool-II
>>
>> 2) there's no existing connection cache
>>
>> 3) try to create new backend connections by calling connect_backend()
>>
>> 4) inside connect_backend(), pool_create_cp() gets called
>>
>> 5) pool_create_cp() calls new_connection()
>>
>> 6) failover occurs and the global backend status is set to down, but
>> the pgpool main does not send kill signal to the child process yet
>>
>> 7) inside new_connection() after checking VALID_BACKEND, it checks the
>> global backend status and finds it is set to down status, so that
>> it returns without creating new connection slot
>>
>> 8) connect_backend() continues and accesses the downed connection slot
>> because local status says it's alive, which results in a segfault.
>
> The situation is almost the same to above except that the segfault
> occurs in pool_do_auth(). (See backtrace and log below)
>
> I guess pool_do_auth was called before Req_info->master_node_id was updated
> in failover(), so MASTER_CONNECTION(cp) was referring the downed connection
> and MASTER_CONNECTION(cp)->sp caused the segfault.
The situation is different in that the segfault explained in
[pgpool-hackers: 3214] was caused by local node status was too old
(the global status was up-to-date), while in this case the global
status is not yet updated. So we cannot employ the same fix as before.
I think the possible fix would be, checking Req_info->switching = true
before referring MASTER_CONNECTION macro. If it's true, refuse to
accept new connection.
What do you think?
> Here is the backtrace from core:
> =================================
> Core was generated by `pgpool: accept connection '.
> Program terminated with signal 11, Segmentation fault.
> #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
> at auth/pool_auth.c:77
> 77 protoMajor = MASTER_CONNECTION(cp)->sp->major;
> Missing separate debuginfos, use: debuginfo-install libmemcached-0.31-1.1.el6.x86_64
> (gdb) bt
> #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
> at auth/pool_auth.c:77
> #1 0x000000000042377f in connect_backend (sp=0x167ae78, frontend=0x1678f28)
> at protocol/child.c:954
> #2 0x0000000000423fdd in get_backend_connection (frontend=0x1678f28)
> at protocol/child.c:2396
> #3 0x0000000000424b94 in do_child (fds=0x16584f0) at protocol/child.c:337
> #4 0x000000000040682d in fork_a_child (fds=0x16584f0, id=372)
> at main/pgpool_main.c:758
> #5 0x0000000000409941 in failover () at main/pgpool_main.c:2102
> #6 0x000000000040cb40 in PgpoolMain (discard_status=<value optimized out>,
> clear_memcache_oidmaps=<value optimized out>) at main/pgpool_main.c:476
> #7 0x0000000000405c44 in main (argc=<value optimized out>,
> argv=<value optimized out>) at main/main.c:317
> (gdb) l
> 72 int authkind;
> 73 int i;
> 74 StartupPacket *sp;
> 75
> 76
> 77 protoMajor = MASTER_CONNECTION(cp)->sp->major;
> 78
> 79 kind = pool_read_kind(cp);
> 80 if (kind < 0)
> 81 ereport(ERROR,
> =======================================-
>
> Here is a snippet of the pgpool log. PID 5067 has a segfault.
> ==================
> (snip)
> 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: starting degeneration. shutdown host xxxxxxxx(xxxx)
> 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: Restart all children
> 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: LOG: new connection received
> 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: DETAIL: connecting host=xxxxxx port=xxxx
> (snip)
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exits with status 0
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exited with success and will not be restarted
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: WARNING: child process with pid: 5067 was terminated by segmentation fault
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5067 exited with success and will not be restarted
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exits with status 0
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exited with success and will not be restarted
> (snip)
> ===================
>
>
>
> Regards,
> --
> Yugo Nagata <nagata at sraoss.co.jp>
More information about the pgpool-hackers
mailing list