Cluster lost files from data dir from the former primary and its replication slot #433

marceloneppel · 2024-03-26T21:20:33Z

Steps to reproduce

Deploy 3 PostgreSQL units.
Restart all the units through an outage.

Expected behavior

All the units should have recovered from the outage.

Actual behavior

Cluster lost files from data dir from the former primary and its replication slot.

Versions

Charm revision: 198

Log output

Patroni logs after the outage (in the former primary unit):

2024-03-22 14:26:44 UTC [196]: ERROR: Error when calling pg_controldata 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 999, in controldata
    data = subprocess.check_output([self.pgcommand('pg_controldata'), self._data_dir], env=env)
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/lib/postgresql/14/bin/pg_controldata', '/var/lib/postgresql/data/pgdata']' returned non-zero exit status 1.
2024-03-22 14:57:36 UTC [196]: ERROR: Error when calling pg_controldata

PostgreSQL logs after calling curl {former-primary-ip}:8008/reinitialize (in the former primary unit):

2024-03-22 15:36:04 UTC [645]: user=,db=,app=,client=,line=1 FATAL:  could not start WAL streaming: ERROR:  replication slot "postgresql_db_0" does not exist
2024-03-22 15:36:10 UTC [630]: user=,db=,app=,client=,line=7 LOG:  received immediate shutdown request
2024-03-22 15:36:10 UTC [630]: user=,db=,app=,client=,line=8 LOG:  database system is shut down
2024-03-22 15:36:24 UTC [754]: user=,db=,app=,client=,line=3 LOG:  starting PostgreSQL 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2024-03-22 15:36:24 UTC [754]: user=,db=,app=,client=,line=4 LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-22 15:36:25 UTC [754]: user=,db=,app=,client=,line=5 LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=1 LOG:  database system was interrupted; last known up at 2024-03-22 15:36:17 UTC
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=2 LOG:  entering standby mode
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=3 LOG:  redo starts at 26/1000028
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=4 LOG:  consistent recovery state reached at 26/1003860
2024-03-22 15:36:25 UTC [754]: user=,db=,app=,client=,line=6 LOG:  database system is ready to accept read-only connections
2024-03-22 15:36:28 UTC [754]: user=,db=,app=,client=,line=7 LOG:  received immediate shutdown request
2024-03-22 15:36:28 UTC [754]: user=,db=,app=,client=,line=8 LOG:  database system is shut down
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=3 LOG:  starting PostgreSQL 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=4 LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=5 LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=1 LOG:  database system was interrupted; last known up at 2024-03-22 15:36:29 UTC
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=2 LOG:  entering standby mode
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=3 LOG:  redo starts at 26/3000028
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=4 LOG:  consistent recovery state reached at 26/3003760
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=6 LOG:  database system is ready to accept read-only connections
2024-03-22 15:36:37 UTC [862]: user=,db=,app=,client=,line=1 FATAL:  could not start WAL streaming: ERROR:  replication slot "postgresql_db_0" does not exist
2024-03-22 15:36:37 UTC [863]: user=,db=,app=,client=,line=1 FATAL:  could not start WAL streaming: ERROR:  replication slot "postgresql_db_0" does not exist
2024-03-22 15:36:40 UTC [855]: user=,db=,app=,client=,line=7 LOG:  received immediate shutdown request
2024-03-22 15:36:40 UTC [855]: user=,db=,app=,client=,line=8 LOG:  database system is shut down

The replication slot is definitely missing (in the new primary unit):

postgres=# select slot_name from pg_replication_slots;
      slot_name       
----------------------
 postgresql_db_1
 pg_basebackup_718248
(2 rows)

Additional context

Matrix discussion.

The issue was solved by creating the replication slot manually (in the new primary unit):

select pg_create_physical_replication_slot('postgresql_db_0');

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-26T21:20:50Z

https://warthogs.atlassian.net/browse/DPE-3887

marceloneppel · 2024-08-26T18:20:56Z

I managed to reproduce this error locally on a Multipass VM. What deleted the replication slot was the absence of Patroni-specific labels in the pod, which weren't set because the upgrade-charm hook hadn't fired in the unit after its restart.

marceloneppel added the bug Something isn't working label Mar 26, 2024

marceloneppel self-assigned this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster lost files from data dir from the former primary and its replication slot #433

Cluster lost files from data dir from the former primary and its replication slot #433

marceloneppel commented Mar 26, 2024

github-actions bot commented Mar 26, 2024

marceloneppel commented Aug 26, 2024

Cluster lost files from data dir from the former primary and its replication slot #433

Cluster lost files from data dir from the former primary and its replication slot #433

Comments

marceloneppel commented Mar 26, 2024

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

github-actions bot commented Mar 26, 2024

marceloneppel commented Aug 26, 2024