[pgpool-hackers: 4510] Re: Watchdog and IPv6

Tatsuo Ishii ishii at postgresql.org
Thu Aug 22 17:13:25 JST 2024


Hi Kwangwon,

Thank you for the updated patches.

> Thanks for all your considerate reviewing and attention. (really appreciate it)
> 
> I had started fixing it just after receiving the first mail. 
> 
> I also want to reply with actual patch files with some progress so that was the reason of big delay in responding to the mail. sorry about that.
> 
> anyway, about some progress.
> 
> -- following advices.
> indentation, %m, compile warning, and so on..
> Yes, I checked again so as try to not to make a mistake again...

> -- about regression tests
> 028.watchdog_enable_consensus_with_half_votes
> yes. timeout.. your tip is exactly the point. so I could fix that.

Great. I have succeeded all regression tests including 028 this time.

> 036.trusted_servers...
> yes. It's covered now.

This suceeded too.

> 004, 011, 012, 013, 014, 015 is covered both localhost and ::1.
> 
> -- some stories (you can ignore this)
> Recently, I reinstalled my devices and just installed ssh library without actual executable binary.
> regress 28 executes pgpool_recovery() and inside the script, starting with the ssh command.
> That was one of the reasons for spending some time.
> 
> And the other reason was, I assigned the wrong port number. what a silly me.
> It was just one line. but I spent too much time trying to find the reason for it. C APIs and Memorycontext mechanism. that's far away from the actual reason.
> 
> -- about attachment (patch files)
> whitespaces checked and created with -v option. This is the 4th submission.
> 
> v4-0001-feature-ipv6-for-watchdog.patch
> Actual patch file. some detailed messages for additional supplementary inside. It's just for an explanation so you can erase that message.
> 
> v4-0002-test-localhost-to-1.patch
> It's just for test purposes. replace address from 'localhost' to '::1' on the regression tests. You can ignore this patch file.

I have tested the patch. It worked well here.

> well, anyway, It's not finished until actually finished.
> 
> So how about this patch? you can say anything at ease..

I forgot to say in the last email. It's very subtle point...

+		ereport(ERROR,
+				(errmsg("failed to get socket data from heartbeat receive socket list")
+				 ,errdetail("select() failed with reason %m")));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errmsg("failed to get socket data from heartbeat receive socket list")
+				 ,errdetail("select() got timeout, exceed %d sec(s)", timeout_sec)));

We usually place "," at the end of the line, not the beginning of the
line. I mean following style is better.

+				(errmsg("failed to get socket data from heartbeat receive socket list"),
+				 errdetail("select() got timeout, exceed %d sec(s)", timeout_sec)));


> ---
> 
> Regards
> Kwangwon Seo
> 
> On Thursday, August 8th, 2024 at 10:35 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> > > I think you should give a version number to your patch using "-v"
>> > > option of git format-patch command.
>> > >
>> > > I have quick looked through the patch.
>> > >
>> > > t-ishii$ git apply ~/0001-feature-ipv6-for-watchdog.patch
>> > > /home/t-ishii/0001-feature-ipv6-for-watchdog.patch:343: trailing whitespace.
>> > >
>> > > /home/t-ishii/0001-feature-ipv6-for-watchdog.patch:355: trailing whitespace.
>> > >
>> > > warning: 2 lines add whitespace errors.
>> > >
>> > > Please remove unnecessary white spaces.
>> > >
>> > > Also I got compiler warnings:
>> > >
>> > > In file included from watchdog.c:51:
>> > > watchdog.c: In function 。watchdog_main「:
>> > > ../../src/include/utils/elog.h:199:4: warning: 。port「 may be used uninitialized in this function [-Wmaybe-uninitialized]
>> > > 199 | errfinish rest; \
>> > > | ^~~~~~~~~
>> > > watchdog.c:3439:8: note: 。port「 was declared here
>> > > 3439 | int port;
>> > > | ^~~~
>> > >
>> > > This is caused by following code snipet:
>> > >
>> > > + switch(ss.ss_family)
>> > > + {
>> > > + case AF_INET:
>> > > + inet_ntop(AF_INET, &((struct sockaddr_in*) &ss)->sin_addr, conn->addr, addrlen);
>> > > + port = ntohs(((struct sockaddr_in*)(&ss))->sin_port);
>> > > + break;
>> > > +
>> > > + case AF_INET6:
>> > > + inet_ntop(AF_INET6, &((struct sockaddr_in6*) &ss)->sin6_addr, conn->addr, addrlen);
>> > > + port = ntohs(((struct sockaddr_in6*)(&ss))->sin6_port);
>> > > + break;
>> > > + }
>> > > +
>> > > + ereport(LOG,
>> > > + (errmsg("new watchdog node connection is received from \"%s:%d\"",conn->addr, port)));
>> > > +
>> > >
>> > > Here the switch statement does not have "default" and if ss.ss_family
>> > > is not either AF_INET or AF_INET6, the value of "port" will be
>> > > undefined. That's why the compiler complains. I think you should have
>> > > "default" something like:
>> > >
>> > > + case AF_INET6:
>> > > + inet_ntop(AF_INET6, &((struct sockaddr_in6*) &ss)->sin6_addr, conn->addr, addrlen);
>> > > + port = ntohs(((struct sockaddr_in6*)(&ss))->sin6_port);
>> > > + break;
>> > > +
>> > > default:
>> > > ereport(ERROR,
>> > > (errmsg("unknown ss_family")));
>> > > break;
>> > >
>> > > BTW the indentations after CASE should be added. See other CASE
>> > > statements in pgpool.
>> > >
>> > > + else if (sel_ret == -1)
>> > > + {
>> > > + ereport(ERROR,
>> > > + (errmsg("failed to get socket data from socket list")
>> > > + ,errdetail("select() failed with errno %d", errno)));
>> > >
>> > > You can use %m" here without using errno. See codes using %m in pgpool.
>> > >
>> > > + }
>> > > + else
>> > > + {
>> > > + /* unreachable. */
>> > > + Assert(FALSE);
>> > > + }
>> > >
>> > > Assert cannot be used in pgpool for now.
>> > >
>> > > Regarding the regression test, I got a timeout error with
>> > > 028.watchdog_enable_consensus_with_half_votes test. Have you
>> > > succeeded?
>> >
>> > In addition, 036.trusted_servers also timed out here.
>>
>>
>> I am looking into this a little bit. The time out was caused by below
>> in the test script:
>>
>> function wait_for_watchdog_startup
>> {
>> while :
>> do
>> grep "lifecheck started" $log >/dev/null
>>
>> if [ $? = 0 ];then
>> break;
>> fi
>> sleep 1
>> done
>> }
>>
>> It looks for "lifecheck started" in the pgpool log. The message is
>> supposed to be issued by lifecheck_main().
>>
>> /* wait until ready to go */
>> while (WD_OK != is_wd_lifecheck_ready())
>> {
>> sleep(pool_config->wd_interval * 10);
>>
>> }
>>
>> ereport(LOG,
>> (errmsg("watchdog: lifecheck started")));
>>
>> Problem is, is_wd_lifecheck_ready() does not return WD_OK. In is_wd_lifecheck_ready(),
>>
>> ereport(DEBUG1,
>> (errmsg("watchdog checking life check is ready"),
>> errdetail("pgpool:%d at \"%s:%d\" has not send the heartbeat signal yet",
>> i, node->hostName, node->pgpoolPort)));
>>
>>
>> So I set log_min_messages to debug1 and got following in the log file.
>>
>> 2024-08-08 10:24:05.651: life_check pid 915179: DEBUG: watchdog checking life check is ready
>> 2024-08-08 10:24:05.651: life_check pid 915179: DETAIL: pgpool:1 at "localhost:50004" has not send the heartbeat signal yet
>> 2024-08-08 10:24:05.651: life_check pid 915179: DEBUG: watchdog checking life check is ready
>> 2024-08-08 10:24:05.651: life_check pid 915179: DETAIL: pgpool:2 at "localhost:50008" has not send the heartbeat signal yet
>>
>> It seems other watchdog's lifecheck process do not send the hearbeat
>> signal, or hearbeat receiver could not received the signal?
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list