[pgpool-hackers: 3171] Re: Query hang issue with extended protocol

Fri Dec 7 07:49:29 JST 2018

Hi Usama,

I need:

- exact Pgpool-II version
- pgpool.conf
- gdb print command result of *pool_session_context when Pgpool-II hangs

Also a full debug log will be helpful.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hi Ishii San
> 
> Thanks for looking into this, As discussed over the call please find the
> attached pgpool log and stack trace for the stuck issue.
> I am currently not able to trim down the test case so sorry for throwing a
> huge log file.
> There is no special configuration on PostgreSQL server side and this issue
> only happens when the load balance node and the primary node are same
> and as you can see from the stack trace the stuck happens in the
> read_kind_from_backend() function and the reason for that is
> the function is trying to read response from the backend node 1(standby)
> while it has only sent the query to the backend node 0 (primary)
> 
> --stack-trace--
> #0  0x00007f1775216c53 in __select_nocancel () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x000000000042def6 in pool_check_fd (cp=cp at entry=0x7f1776fb2328) at
> protocol/pool_process_query.c:696
> #2  0x000000000042e1c6 in pool_check_fd (cp=cp at entry=0x7f1776fb2328) at
> protocol/pool_process_query.c:728
> #3  0x000000000045f92c in pool_read (cp=0x7f1776fb2328,
> buf=buf at entry=0x7ffebb5f2345,
> len=len at entry=1) at utils/pool_stream.c:166
> #4  0x0000000000432eab in read_kind_from_backend
> (frontend=frontend at entry=0x2292e88,
> backend=backend at entry=0x7f1776fad6d8,
> decided_kind=decided_kind at entry=0x7ffebb5f274a
> "D")
>     at protocol/pool_process_query.c:3379
> #5  0x000000000043f232 in ProcessBackendResponse
> (frontend=frontend at entry=0x2292e88,
> backend=backend at entry=0x7f1776fad6d8, state=state at entry=0x7ffebb5f27b4,
>     num_fields=num_fields at entry=0x7ffebb5f27b2) at
> protocol/pool_proto_modules.c:2653
> #6  0x0000000000431d17 in pool_process_query (frontend=0x2292e88,
> backend=0x7f1776fad6d8, reset_request=reset_request at entry=0) at
> protocol/pool_process_query.c:320
> #7  0x000000000042cbb9 in do_child (fds=fds at entry=0x2291370) at
> protocol/child.c:387
> #8  0x00000000004090a5 in fork_a_child (fds=0x2291370, id=2) at
> main/pgpool_main.c:659
> #9  0x000000000040fb73 in PgpoolMain (discard_status=discard_status at entry=0
> '\000', clear_memcache_oidmaps=clear_memcache_oidmaps at entry=0 '\000') at
> main/pgpool_main.c:378
> #10 0x0000000000407777 in main (argc=<optimized out>, argv=<optimized out>)
> at main/main.c:361
> 
> Thanks
> Best Regards
> Muhammad Usama
> 
> 
> 
> On Thu, Dec 6, 2018 at 3:52 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> > Hi Ishii San
>> >
>> > I was investigating the issue faced by one of Pgpool-II user who reported
>> > that after executing a certain transaction (large amount of data DELETE /
>> > INSERT / UPDATE, COMMIT), the next execution of the BEGIN statement
>> > hangs(using the extended query protocol).
>> >
>> > I think the problem is in pool_is_node_to_be_sent_in_current_query()
>> > function. The issue is in streaming replication mode, when
>> > pool_is_query_in_progress is not set, the function always returns true.
>> Now
>> > consider that we have two backend servers 1 primary and one standby. And
>> > for a particular session the primary node gets selected as a load balance
>> > node. So effectively that session should only be sending queries and
>> > reading responses from primary server alone. But with the current
>> > implementation of pool_is_node_to_be_sent_in_current_query() the
>> > VALID_BACKEND macro will return true for standby node as well when the
>> > query in progress flag is not set(which is wrong when standby node is not
>> > selected for load balancing). and if this happens in let say
>> > read_kind_from_backend() function we will get a stuck, since the
>> > read_kind_from_backend() will be trying to read the response from standby
>> > node while we have never sent any query to that node.
>> >
>> >
>> > Can you please take a look at the attached patch, I am not sure if it
>> will
>> > have some side effects or not.
>>
>> Unfortunately the patch would not help and probably would break many
>> cases which are working today.
>>
>> For the scenario you mentioned is already handled by
>> pool_pending_message_* functions defined in
>> context/query_context.c. When a message, for example parse, is sent
>> from frontend and forwarded to backends, the info is recorded in an
>> FIFO queue by pool_pending_message_add() along with info which of
>> backend the message is forwarded to. When the parse message will be
>> replied back with "parse complete message" from the backend,
>> read_kind_message_from_backend() can know from which of backend it
>> should read a message by looking at the head of the queue.  Once the
>> message is retrieved (in this example parse complete message), an
>> entry in the queue (in this example parse message) is pulled out and
>> removed.
>>
>> This mechanism usually works well unless backend sends unpredicted
>> message. Recently one of such a case was reported:
>> https://www.pgpool.net/mantisbt/view.php?id=448
>>
>> In the report PostgreSQL sends an error message caused by
>> idle_in_transaction_timeout setting which Pgpool-II does not expect.
>>
>> Does the user has any uncommon setting in postgresql.conf?
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>