[pgpool-hackers: 3361] Overhaul node failure debugging aid

Tatsuo Ishii ishii at sraoss.co.jp
Fri Aug 2 15:30:32 JST 2019


check_backend_down_request() in health_check.c is intended to simulate
the situation where communication failure between health check and
PostgreSQL backend node by creating a file containing lines:

1	down

where the first numeric is the node id starting from 0, tab, and
"down". When health check process finds the file, let health check
fails on node 1.

After health check brings the node into down status,
check_backend_down_request() change "down" to "already_down" to
prevent repeating node failure.

However, questions is, this is necessary at all. I think
check_backend_down_request() should keep on reporting the down status
and it should be called inside establish_persistent_connection() to
prevent repeating node failure because it could be better simulated
the failing situation in this way. For example, currently the health
check retry is not simulated but the new way can do it.

Moreover, in current watchdog implementation, to bring a node into
quarantine state requires *two" times of node communication error
detection. Since check_backend_down_request() only allows to raise
node down even *once" (after the down state is changed to already_down
state), it's impossible to test the watchdog quarantine using
check_backend_down_request(). I changed check_backend_down_request()
so that it continues to raise "down" event as long as the down request
file exists.

Attached patch tries to enhance check_backend_down_request() as
described above.

1) caller of check_backend_down_request() is
   establish_persistent_connection(), rather than
   do_health_check_child().

2) check_backend_down_request() does not change "down" to
   "already_down" anymore. This means that the second argument of
   check_backend_down_request() is not useful anymore. Probably I
   should remove the argument later on.

If there's no objection, I will commit/push this by the end of this
weekend.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: health_check.diff
Type: text/x-patch
Size: 2023 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190802/5622b867/attachment.bin>


More information about the pgpool-hackers mailing list