[pgpool-general: 2783] Child process hangs in active state

Mon Apr 21 15:28:52 JST 2014

Hi, we've run into a very strange situation where a child process hangs in
active state.

*** Versions

- os: RHEL 5.7
- pgpool: 3.3.2
- backend: EnterpriseDB 9.3.1.3 on x86_64
- edb jdbc driver

*** show pool_pools;

 pool_pid |     start_time      | pool_id | backend_id | database |
username   |     create_time     | majorversion | minorversion |
pool_counter | pool_backendpid | pool_connected
----------+---------------------+---------+------------+----------+--------------+---------------------+--------------+--------------+--------------+-----------------+----------------
 26462    | 2014-04-18 17:10:58 | 0       | 0          | cloud    |
enterprisedb | 2014-04-18 17:10:58 | 3            | 0            | 11518
     | 20546           | 1
 26462    | 2014-04-18 17:10:58 | 0       | 1          | cloud    |
enterprisedb | 2014-04-18 17:10:58 | 3            | 0            | 11518
     | 29160           | 1
 26462    | 2014-04-18 17:10:58 | 1       | 0          |          |
     |                     | 0            | 0            | 0            | 0
              | 0
 26462    | 2014-04-18 17:10:58 | 1       | 1          |          |
     |                     | 0            | 0            | 0            | 0
              | 0
 26462    | 2014-04-18 17:10:58 | 2       | 0          |          |
     |                     | 0            | 0            | 0            | 0
              | 0
 26462    | 2014-04-18 17:10:58 | 2       | 1          |          |
     |                     | 0            | 0            | 0            | 0
              | 0
 26462    | 2014-04-18 17:10:58 | 3       | 0          |          |
     |                     | 0            | 0            | 0            | 0
              | 0
 26462    | 2014-04-18 17:10:58 | 3       | 1          |          |
     |                     | 0            | 0            | 0            | 0
              | 0

*** pg_stat_activity

edb=# select * from pg_stat_activity where pid = 20546;
 datid | datname |  pid  | usesysid |   usename    | application_name |
client_addr  | client_hostname | client_port |          backend_start
    |           xact_start            |           query_
start           |           state_change           | waiting | state  |
      query
-------+---------+-------+----------+--------------+------------------+--------------+-----------------+-------------+----------------------------------+---------------------------------+-----------------
----------------+----------------------------------+---------+--------+----------------------------
 16391 | cloud   | 20546 |       10 | enterprisedb |                  |
10.12.240.94 |                 |       30104 | 18-APR-14 17:10:58.329551
+09:00 | 18-APR-14 17:14:36.83873 +09:00 | 18-APR-14 17:14:
36.83873 +09:00 | 18-APR-14 17:14:36.838732 +09:00 | f       | active | SET
extra_float_digits = 3

*** ps output (pgpool server)

pgpool   32395  0.0  0.0  24160  4848 ?        S    Apr15   0:11  \_
/db/pgpool/bin/pgpool -n -D -f /db/pgpool/conf/pgm/pgpool.conf -F
/db/pgpool/conf/pgm/pcp.conf -a /db/pgpool/conf/pgm/pool_hba.conf
pgpool   18534  0.0  0.0  24160   704 ?        S    Apr18   0:00  |   \_
pgpool: PCP: wait for connection request
pgpool   18535  0.0  0.0  24160  1076 ?        S    Apr18   0:03  |   \_
pgpool: worker process
pgpool   26462  0.0  0.0 830632  9032 ?        S    Apr18   0:10  |   \_
pgpool: enterprisedb cloud ---------(redacted)---------(15878) idle
pgpool   11259  0.0  0.0  24160   876 ?        S    13:31   0:00  |   \_
pgpool: wait for connection request
pgpool   11261  0.0  0.0  24160   864 ?        S    13:31   0:00  |   \_
pgpool: wait for connection request
pgpool   11262  0.0  0.0  24160   864 ?        S    13:31   0:00  |   \_
pgpool: wait for connection request
pgpool   11263  0.0  0.0  24160   864 ?        S    13:31   0:00  |   \_
pgpool: wait for connection request
...

*** ps output (backend)

$ ps auxf | grep 20546
hanadmin 12046  0.0  0.0  61188   744 pts/8    S+   15:07   0:00
   \_ grep 20546
541      20546  0.0  0.0 8761172 7448 ?        Ss   Apr18   0:07  \_
postgres: enterprisedb cloud 10.12.240.94[30104] PARSE

*** gdb trace

(gdb) bt
#0  0x00000033308cd6d3 in __select_nocancel () from /lib64/libc.so.6
#1  0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at
pool_process_query.c:951
#2  0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at
pool_stream.c:264
#3  0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1',
frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136
#4  0x00000000004499bb in ProcessBackendResponse (frontend=0xbaf1180,
backend=0xbad8330, state=0x7fff94626060, num_fields=0x7fff94626066) at
pool_proto_modules.c:2670
#5  0x000000000041b69d in pool_process_query (frontend=0xbaf1180,
backend=0xbad8330, reset_request=0) at pool_process_query.c:289
#6  0x000000000040ad8a in do_child (unix_fd=4, inet_fd=5) at child.c:355
#7  0x000000000040456f in fork_a_child (unix_fd=4, inet_fd=5, id=0) at
main.c:1258
#8  0x0000000000404897 in reaper () at main.c:2482
#9  0x0000000000404c25 in pool_sleep (second=<value optimized out>) at
main.c:2679
#10 0x0000000000407a0a in main (argc=<value optimized out>, argv=<value
optimized out>) at main.c:856
(gdb) up
#1  0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at
pool_process_query.c:951
951                     fds = select(fd+1, &readmask, NULL, &exceptmask,
timeoutp);
(gdb)
#2  0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at
pool_stream.c:264
264                     if (pool_check_fd(cp))
(gdb)
#3  0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1',
frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136
1136                                    p = pool_read2(CONNECTION(backend,
i), len);
(gdb) p len
$1 = 822083580
(gdb) p *backend->slots[i].con
$3 = {fd = 10, wbuf = 0xbae01d0 "P", wbufsz = 8192, wbufpo = 0, ssl_active
= -1, hp = 0xbae21e0 "1", po = 0, bufsz = 1024, len = 0, sbuf = 0x0, sbufsz
= 0, buf2 = 0x2aaf32c92010 "\004OMMIT",
  bufsz2 = 822083584, buf3 = 0x0, bufsz3 = 0, isbackend = 1, db_node_id =
1, tstate = 73 'I', is_internal_transaction_started = 0 '\000', auth_kind =
0, pwd_size = 0,
  password = '\000' <repeats 1023 times>, salt = "\000\000\000", params =
{num = 0, names = 0xbae25f0, values = 0xbae2a00}, no_forward = 0, kind = 0
'\000', protoVersion = 0, raddr = {addr = {
      ss_family = 0, __ss_align = 0, __ss_padding = '\000' <repeats 111
times>}, salen = 0}, auth_method = uaReject, auth_arg = 0x0, database =
0x0, username = 0x0}

***

`len` and `bufsz2` being over 800MB looks extremely odd. Could it be a bug
of pgpool?
The ps output shows that the process alone actually allocated 800MB of
address space.
This has happened several times for us, but we have no idea what caused it
and we just had to kill the process.

Any idea?

-- 
cheers,
junegunn choi.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140421/f5bdcdda/attachment.htm>