Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster lost files from data dir from the former primary and its replication slot #433

Open
marceloneppel opened this issue Mar 26, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@marceloneppel
Copy link
Member

Steps to reproduce

  1. Deploy 3 PostgreSQL units.
  2. Restart all the units through an outage.

Expected behavior

All the units should have recovered from the outage.

Actual behavior

Cluster lost files from data dir from the former primary and its replication slot.

Versions

Charm revision: 198

Log output

Patroni logs after the outage (in the former primary unit):

2024-03-22 14:26:44 UTC [196]: ERROR: Error when calling pg_controldata 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 999, in controldata
    data = subprocess.check_output([self.pgcommand('pg_controldata'), self._data_dir], env=env)
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/lib/postgresql/14/bin/pg_controldata', '/var/lib/postgresql/data/pgdata']' returned non-zero exit status 1.
2024-03-22 14:57:36 UTC [196]: ERROR: Error when calling pg_controldata

PostgreSQL logs after calling curl {former-primary-ip}:8008/reinitialize (in the former primary unit):

2024-03-22 15:36:04 UTC [645]: user=,db=,app=,client=,line=1 FATAL:  could not start WAL streaming: ERROR:  replication slot "postgresql_db_0" does not exist
2024-03-22 15:36:10 UTC [630]: user=,db=,app=,client=,line=7 LOG:  received immediate shutdown request
2024-03-22 15:36:10 UTC [630]: user=,db=,app=,client=,line=8 LOG:  database system is shut down
2024-03-22 15:36:24 UTC [754]: user=,db=,app=,client=,line=3 LOG:  starting PostgreSQL 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2024-03-22 15:36:24 UTC [754]: user=,db=,app=,client=,line=4 LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-22 15:36:25 UTC [754]: user=,db=,app=,client=,line=5 LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=1 LOG:  database system was interrupted; last known up at 2024-03-22 15:36:17 UTC
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=2 LOG:  entering standby mode
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=3 LOG:  redo starts at 26/1000028
2024-03-22 15:36:25 UTC [757]: user=,db=,app=,client=,line=4 LOG:  consistent recovery state reached at 26/1003860
2024-03-22 15:36:25 UTC [754]: user=,db=,app=,client=,line=6 LOG:  database system is ready to accept read-only connections
2024-03-22 15:36:28 UTC [754]: user=,db=,app=,client=,line=7 LOG:  received immediate shutdown request
2024-03-22 15:36:28 UTC [754]: user=,db=,app=,client=,line=8 LOG:  database system is shut down
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=3 LOG:  starting PostgreSQL 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=4 LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=5 LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=1 LOG:  database system was interrupted; last known up at 2024-03-22 15:36:29 UTC
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=2 LOG:  entering standby mode
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=3 LOG:  redo starts at 26/3000028
2024-03-22 15:36:37 UTC [858]: user=,db=,app=,client=,line=4 LOG:  consistent recovery state reached at 26/3003760
2024-03-22 15:36:37 UTC [855]: user=,db=,app=,client=,line=6 LOG:  database system is ready to accept read-only connections
2024-03-22 15:36:37 UTC [862]: user=,db=,app=,client=,line=1 FATAL:  could not start WAL streaming: ERROR:  replication slot "postgresql_db_0" does not exist
2024-03-22 15:36:37 UTC [863]: user=,db=,app=,client=,line=1 FATAL:  could not start WAL streaming: ERROR:  replication slot "postgresql_db_0" does not exist
2024-03-22 15:36:40 UTC [855]: user=,db=,app=,client=,line=7 LOG:  received immediate shutdown request
2024-03-22 15:36:40 UTC [855]: user=,db=,app=,client=,line=8 LOG:  database system is shut down

The replication slot is definitely missing (in the new primary unit):

postgres=# select slot_name from pg_replication_slots;
      slot_name       
----------------------
 postgresql_db_1
 pg_basebackup_718248
(2 rows)

Additional context

Matrix discussion.

The issue was solved by creating the replication slot manually (in the new primary unit):

select pg_create_physical_replication_slot('postgresql_db_0');
@marceloneppel marceloneppel added the bug Something isn't working label Mar 26, 2024
Copy link
Contributor

@marceloneppel marceloneppel self-assigned this Jul 15, 2024
@marceloneppel
Copy link
Member Author

I managed to reproduce this error locally on a Multipass VM. What deleted the replication slot was the absence of Patroni-specific labels in the pod, which weren't set because the upgrade-charm hook hadn't fired in the unit after its restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant