[pgpool-general: 8950] Re: pgpool 4.4.4: reading status file: 1 th backend is set to down status
Tatsuo Ishii
ishii at sraoss.co.jp
Tue Oct 17 18:26:33 JST 2023
> Hi Tatsuo,
>
>
>
> thank you so much for the detailed explanation. That makes all sense to me - one detail in the combination of acting software is not matching from my point of view: Why is pgpool trying to prevent a split brain scenario? I'll explain my thoughts and maybe you can point out at which point pgpool is misconfigured or the misunderstanding.
>
>
>
> As I mentioned, I'm using containers packed by Bitnami, which are using pgpool as a proxy for the clients to find always primary instance<https://hub.docker.com/r/bitnami/pgpool> and repmgr and pg to provide a HA solution for Postgres<https://hub.docker.com/r/bitnami/postgresql-repmgr>.
> What is Pgpool-II?
> Pgpool-II is the PostgreSQL proxy. It stands between PostgreSQL servers and their clients providing connection pooling, load balancing, automated failover, and replication.
>
>
>
>
>
> What is PostgreSQL HA?
>
> This PostgreSQL cluster solution includes the PostgreSQL replication manager, an open-source tool for managing replication and failover on PostgreSQL clusters.
>
>
>
> As far as I understand, Bitnami designed it like this, that Postgres in combination with repmgr<https://www.repmgr.org/> should do all the High Availability work. Pgpool should "just" forward connections from the clients to the primary instance of postgres.
Pgpool-II has evolved as a cluster software. Proxy is just a one of
features of Pgpool-II.
> In the error case I'm describing here, just pgpool has a problem with the situation. The cluster itself is working. The connection to every pg and pgpool is working. It is just that one pgpool remains unhealthy because it can't determine a primary because it has marked the server, which is indeed primary, as down.
Anybody could combine Pgpoool-II with other software but then it's
his/her responsibility to understand the functionality/interface of
Pgpool-II and make the whole system work, not Pgpool-II's
resposibility. To be honest, I don't understand why you need repmgr
in the first place because Pgpool-II has already the HA feature too.
>>From my point of view (and I'm just the user in the middle of the software giants postgres, repmgr, pgpool and Bitnami), I think there is a gap in configuration of pgpool to just act as proxy. Nothing to do during failover nor nothing extra. Therefore I decided to start with my investigation here at the pgpool mailing list. As far as I understand it, if pgpool would ignore/disable the pgpool_status file, my scenario would be fine. Maybe there is another configuration which does exactly what Bitnami expected from pgpool to do. I think you are the very right person who could know this. Therefore the question: What do I need to configure that pgpool is just acting as a proxy (which was the intention of Bitnami)? OR How can I disable the pgpool_status file (I already was thinking to delete it in the entry point of the container)?
You can ignore the pgpool_status file by starting Pgpool-II with -D
option. See the manual for more details. Note, however, In this case,
the split brain problem should be prevented by other software.
> Thank you so much, I appreciate your expertise,
You are welcome!
> best regards, Daniel
>
>
>
>
> AZO GmbH & Co. KG
> Rosenberger Str. 28
> D-74706
> Osterburken
> Tel.: +49 6291 92-6449
> Mob.: +49 162 9919448
> Fax: +49 6291 9290449
> Mail: Daniel.Camarena at azo.com
> Web: http://www.azo.com/
> AZO. We Love Ingredients.
> KG: Sitz Osterburken, Register-Gericht Mannheim HRA 450086, Persönlich haftende Gesellschafterin: AZO Beteiligungs GmbH, Sitz Osterburken, Register-Gericht Mannheim HRB 450261
> Geschäftsführer: Rainer Zimmermann | Daniel Auerhammer | Dr. Matthias Fechner | Jan-Wilko Helms | Dennis Künkel
>
> Diese E-Mail einschließlich ihrer Anhänge ist vertraulich. Wir bitten Sie, eine fehlgeleitete E-Mail zu löschen und uns eine Nachricht zukommen zu lassen. Wir haben die E-Mail vor dem Versenden auf Virenfreiheit geprüft. Eine Haftung für Virenfreiheit schließen wir jedoch aus.
>
> This e-mail and its attachments are confidential. If you are not the intended recipient of this e-mail message, please delete it and inform us accordingly. This e-mail was checked for viruses when sent, however we are not liable for any virus contamination.
> -----Ursprüngliche Nachricht-----
> Von: Tatsuo Ishii <ishii at sraoss.co.jp>
> Gesendet: Dienstag, 17. Oktober 2023 02:49
> An: Camarena Daniel <Daniel.Camarena at azo.com>
> Cc: pgpool-general at pgpool.net
> Betreff: Re: AW: [pgpool-general: 8942] pgpool 4.4.4: reading status file: 1 th backend is set to down status
>
>
>
> [Sie erhalten nicht h?ufig E-Mails von ishii at sraoss.co.jp<mailto:ishii at sraoss.co.jp>. Weitere Informationen, warum dies wichtig ist, finden Sie unter https://aka.ms/LearnAboutSenderIdentification ]
>
>
>
>> Hi Tatsuo,
>
>>
>
>>
>
>>
>
>> thanks for your reply and the explication.
>
>>
>
>>
>
>>
>
>> To comment your answers:
>
>>
>
>>> > 1. Is there a file which buffers pg states?
>
>>
>
>>> If you mean "pg_status" column in show pool_nodes command, no. It is obtained from PostgreSQL on the fly when show pool_nodes command gets executed.
>
>>
>
>> Yes. But it seems that out of pg_status is formed a resulting state which is in the column status of show pool_nodes (see results below) and this indicates that the service is down - and pgpool is acting like this. See below the log of pgpool: It indicates, that is marking 0 th node as down because of the "status file".
>
>
>
> I think you are talking about "status" column, rather than "pg_status"
>
> column in show pool_nodes command. The "status" column is set to down by pgpool. The "status" is set according to the content of pgpool_status file upon starting up of pgpool. If it's set to "down", it is not set to "up" until pcp_attach_node or pcp_recovery_node is executeds. The reason is explained below.
>
>
>
>>> > 2. How did the system get into this state?
>
>>
>
>>> I am not familiar with bitnami pgpool nor repmgr. So what I can do
>
>>> is answer from the point of pgpool view. It was caused by either
>
>>> failover triggered by health check (pgpool detects error / shutdown
>
>>> of PostgreSQL), or pcp_detach_node gets executed. I cannot tell
>
>>> either unless looking into pgpool log and pgpool.conf
>
>>
>
>> Pg0 had tons of these messages:
>
>>
>
>> 2023-10-11 11:19:03.522 GMT [956538] FATAL: remaining connection
>
>> slots are reserved for non-replication superuser connections
>
>>
>
>> 2023-10-11 11:19:03.525 GMT [956537] FATAL: remaining connection
>
>> slots are reserved for non-replication superuser connections
>
>>
>
>> 2023-10-11 11:19:03.542 GMT [956539] FATAL: remaining connection
>
>> slots are reserved for non-replication superuser connections
>
>>
>
>> 2023-10-11 11:19:03.545 GMT [956540] FATAL: remaining connection
>
>> slots are reserved for non-replication superuser connections
>
>>
>
>> Pg1 has right now, as I was examining the system the same messages. Sometimes they appear and it seems that because of the a failover occurs - like you described before.
>
>>
>
>> Should I just increase max_connections, default 100, to 200 to prevent the problem?
>
>
>
> Yes, I think so. The above FATAL error could cause the health_check to trigger failover depending on the setting of pgpool.conf.
>
>
>
>> In the meanwhile I have found a file in the logs folder of pgpool. It has the following content:
>
>>
>
>> root at c8bdc87693d4:/opt/bitnami/pgpool/logs# cat pgpool_status
>
>>
>
>> down
>
>>
>
>> up
>
>>
>
>> up
>
>>
>
>>
>
>>
>
>> As pgpool has a line during startup
>
>>
>
>> 2023-10-16 05:28:21.670: main pid 1: LOG: reading status file:
>
>> 0 th backend is set to down status
>
>>
>
>> I thought this file is read and the status of pg0 is overridden by this.
>
>
>
> Yes, your guess is correct. This is necessary to prevent "split brain"
>
> problem. Suppose you have PostgreSQL node 0 (primary) and node 1 (standby).
>
>
>
> 1) node 0 goes down by admin.
>
> 2) node 1 becomes new primary node.
>
> 3) whole system (pgpool, node 0 and node 1) restarts.
>
> 4) node 0 starts as primary, node 1 starts as primary.
>
> 5) now you have two primary nodes, this is the split brain.
>
>
>
> The pgpool_status file prevents this situation. In #3, pgpool sets the node 0 status to down by reading pgpool_status file. Therefore it prevents the situation #5.
>
>
>
>> show pool_nodes; returns the following:
>
>>
>
>> node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
>
>>
>
>>
>
>> ---------+-----------+------+--------+-----------+-----------+--------
>
>> -+---------+------------+-------------------+-------------------+-----
>
>> --------------+------------------------+---------------------
>
>>
>
>> 0 | 10.0.10.7 | 5432 | down | up | 0.333333 | standby | primary | 0 | false | 0 | | | 2023-10-16 05:29:17
>
>>
>
>> 1 | 10.0.10.8 | 5432 | up | up | 0.333333 | standby | standby | 5 | false | 0 | | | 2023-10-16 05:29:17
>
>>
>
>> 2 | 10.0.10.9 | 5432 | up | up | 0.333333 | standby | standby | 11 | true | 0 | | | 2023-10-16 05:29:17
>
>>
>
>> (3 rows)
>
>>
>
>> Indicates, that pg_role of pg0 is primary but the resulting role is standby, as resulting status is down, even if pg_status is up.
>
>
>
> That's an exepcted behavior. Since pgpool recognizes node 0 as "down"
>
> by reading pgpool_status file, it does not check node 0 to see whether node 0 is primary or not. Pgpool only checks node 1 and node 2, and concluded that there's no primary.
>
>
>
> I think you can use pcp_attach_node on node 0.
>
>
>
>> As orchestration always starts pgpool new, I post the startup sequence of the container:
>
>
>
> [snip]
>
>
>
>> 2023-10-16 05:28:21.757: main pid 1: LOG:
>
>> find_primary_node_repeatedly: waiting for finding a primary node
>
>>
>
>> 2023-10-16 05:28:21.793: main pid 1: LOG: find_primary_node: standby
>
>> node is 1
>
>>
>
>> 2023-10-16 05:28:21.793: main pid 1: LOG: find_primary_node: standby
>
>> node is 2
>
>
>
> Here, pgpool was looked for primary node only against node 1 and 2 by the reason above.
>
>
>
>> Last but not least pgpool.conf you requested. I left the comments in the file:
>
>
>
> Thanks. I noticed that "failover_command = 'echo...". This means when primary godes down, pgpool does not elect new primary. I am not sure if this is correct idea. Maybe the orchestration tool does something which is out of scope of pgpool?
>
>
>
> Best reagards,
>
> --
>
> Tatsuo Ishii
>
> SRA OSS LLC
>
> English: http://www.sraoss.co.jp/index_en/<http://www.sraoss.co.jp/index_en/>
>
> Japanese:http://www.sraoss.co.jp/
More information about the pgpool-general
mailing list