[pgpool-hackers: 2545] (no subject)
Tatsuo Ishii
ishii at sraoss.co.jp
Thu Sep 21 15:18:27 JST 2017
To test a problem reported in [pgpool-general: 5728], I was running
./startall failover and ./shutdown loop in a cluster created by
pgpool_setup. I found sometimes ./shutdown doesn't finish, and the
cause was pgpool parent waiting for pcp_main finishes. I attached gdb
to the pcp_main process to see what's going on. Here's the backtrace.
(gdb) bt
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f78d47eac76 in __tz_convert (timer=0x7f78d4af3ac0 <tzset_lock>, timer at entry=0x7ffd01c8fbe8,
use_localtime=use_localtime at entry=1, tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzset.c:616
#2 0x00007f78d47e8941 in __GI_localtime (t=t at entry=0x7ffd01c8fbe8) at localtime.c:39
#3 0x000000000045c386 in log_line_prefix (buf=buf at entry=0x7ffd01c8fcc0, line_prefix=<optimized out>,
edata=0x75d9a0 <errordata+384>) at ../../src/utils/error/elog.c:2059
#4 0x000000000045efb5 in send_message_to_server_log (edata=0x75d9a0 <errordata+384>)
at ../../src/utils/error/elog.c:2084
#5 EmitErrorReport () at ../../src/utils/error/elog.c:1129
#6 0x000000000045d13e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434
#7 0x00000000004094d5 in exit_handler (sig=15) at main/pgpool_main.c:1415
#8 <signal handler called>
#9 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
#10 0x00007f78d47eac76 in __tz_convert (timer=0x7f78d4af3ac0 <tzset_lock>, timer at entry=0x7ffd01c90428,
use_localtime=use_localtime at entry=1, tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzset.c:616
#11 0x00007f78d47e8941 in __GI_localtime (t=t at entry=0x7ffd01c90428) at localtime.c:39
#12 0x000000000045c386 in log_line_prefix (buf=buf at entry=0x7ffd01c90500, line_prefix=<optimized out>,
edata=0x75d8e0 <errordata+192>) at ../../src/utils/error/elog.c:2059
#13 0x000000000045efb5 in send_message_to_server_log (edata=0x75d8e0 <errordata+192>)
at ../../src/utils/error/elog.c:2084
#14 EmitErrorReport () at ../../src/utils/error/elog.c:1129
#15 0x000000000045d13e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434
#16 0x00000000004094d5 in exit_handler (sig=2) at main/pgpool_main.c:1415
#17 <signal handler called>
#18 __tzfile_compute (timer=1505947017, use_localtime=use_localtime at entry=1,
leap_correct=leap_correct at entry=0x7ffd01c90bb8, leap_hit=leap_hit at entry=0x7ffd01c90bb4,
tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzfile.c:648
#19 0x00007f78d47ead7f in __tz_convert (timer=timer at entry=0x7ffd01c90c08,
use_localtime=use_localtime at entry=1, tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzset.c:624
#20 0x00007f78d47e8941 in __GI_localtime (t=t at entry=0x7ffd01c90c08) at localtime.c:39
#21 0x000000000045c386 in log_line_prefix (buf=buf at entry=0x7ffd01c90ce0, line_prefix=<optimized out>,
edata=0x75d820 <errordata>) at ../../src/utils/error/elog.c:2059
#22 0x000000000045efb5 in send_message_to_server_log (edata=0x75d820 <errordata>)
at ../../src/utils/error/elog.c:2084
#23 EmitErrorReport () at ../../src/utils/error/elog.c:1129
#24 0x000000000045d13e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434
#25 0x0000000000416415 in pcp_main (unix_fd=unix_fd at entry=8, inet_fd=inet_fd at entry=9)
at pcp_con/pcp_child.c:104
#26 0x000000000040a7b4 in pcp_fork_a_child (unix_fd=8, inet_fd=9, pcp_conf_file=<optimized out>)
at main/pgpool_main.c:709
#27 0x000000000040d073 in failover () at main/pgpool_main.c:2221
#28 0x000000000040d84e in sigusr1_interrupt_processor () at main/pgpool_main.c:1588
#29 0x000000000040da2d in pool_sleep (second=<optimized out>) at main/pgpool_main.c:2752
#30 0x000000000040fc37 in PgpoolMain (discard_status=discard_status at entry=1 '\001',
clear_memcache_oidmaps=clear_memcache_oidmaps at entry=0 '\000') at main/pgpool_main.c:536
#31 0x00000000004081d5 in main (argc=<optimized out>, argv=<optimized out>) at main/main.c:313
It seems failover() in pgpool main forks pcp_main, then it is
immediately interrupted at line 104 (this is Pgpool-II 3.6.6):
ereport(DEBUG1,
(errmsg("I am PCP child with pid:%d",getpid())));
then jump to the signal handler: exit_handler in *pgpool main* because
it was interrupted by signal 2 (SIGINT by fast shutdown). In the
signal handler, ereport was called and ereport called tz_convert which
acquired a lock. While the process holding the lock, it was interrupted
again by signal 15 (SIGTERM), which I don't know why. Anyway then the
signal handler called ereport thus tz_convert again, which waited
forever for acquiring a lock because the lock was already acquired. I
saw multiple problems here.
1) pcp main should not call ereport until their own exception handler
is established. I think we can just move the ereport call after the
exit handler established. i.e. After this:
/* We can now handle ereport(ERROR) */
PG_exception_stack = &local_sigjmp_buf;
2) We should eliminate calling ereport/elog calls in signal handlers
to prevent this kind of tz_convert race condition.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
More information about the pgpool-hackers
mailing list