[DPE-4839] Point In Time Recovery #554

Zvirovyi · 2024-07-05T10:30:23Z

This is work on porting the same functionality from the VM.
Resolves #553.

Point In Time Recovery

Add 'restore-to-time' parameter for 'restore' action to be able to do point-in-time-recovery. Also, add 'test_pitr_backup' integration test.

Overview

Whether PostgreSQL charm is active, it writes write-ahead log (WAL) - log of all the transactions with corresponding times. Periodically (mostly based on minimum size constraint) new WAL file is generated and then is archived to the stanza through s3-integrator relation. This WAL can be useful, if you want to restore not only backup, but transactions occurred after that - this is named Point-In-Time-Recovery.

Use-case example

Let's assume, that you have model with PostgreSQL and s3-integrator units, both are configured and integrated between each other.

Firstly, you will need standard backup - it will be base for our later WAL applying:

juju run postgresql/leader create-backup

Then, you need to do some transactions and switch WAL file manually (for showcase):

create table asd(message text);
select current_timestamp; # 2024-04-22 06:54:42.280896+00
insert into asd values ('test1');
select current_timestamp; # 2024-04-22 06:55:08.442248+00
insert into asd values ('test2');
select current_timestamp; # 2024-04-22 06:55:20.442248+00 (not possible to reach as last transaction occurred in the row above)
select pg_switch_wal();

Shortly after that, you can start your PITR:

juju run postgresql/leader restore restore-to-time="2024-04-22 06:54:42.280896+00"

Also, in this example as target you can use second timestamp, but not third, as third was selected after all the transactions. Basically, DB knows only the time of transactions, and time outside of them is unknown. So, if you try to run PITR with third timestamp, it will fail with error "cannot reach PITR target".

For that cases when you need to apply all of the transactions in the WAL, you can use juju run postgresql/leader restore restore-to-time="latest".

After any restore (even ordinary full backup), database will restore to the status "Move restored cluster to another S3 bucket".

Must be known

Move restored cluster to another S3 bucket

Database will now become in "Move restored cluster to another S3 bucket" status after every recovery. This is made with purpose to warn user: restoring backup will overwrite current WAL timeline, and this will separate tree of events starting from the backup starting point.

For example:

User has full backup on day 1 and WAL for day 1 - day 5
Then, user decided to recover to day 2 (with same S3 bucket re-switch) and filled WAL through day 5 - day 8
From now, user has lost his opportunity to recover to day 2 - day 5 period, as new timeline became the latest timeline

Technically, day 2 - day 5 period will still be in WAL archive - only opportunity to use it will be lost.

So, requiring user to change stanza, ensures no such data lost.

For failed PITR, `juju debug-log` will report latest transaction time

unit-postgresql-0: 07:07:48 ERROR unit.postgresql/0.juju-log Restore failed: database service failed to reach point-in-time-recovery target. You can launch another restore with different parameters
unit-postgresql-0: 07:07:48 ERROR unit.postgresql/0.juju-log Last completed transaction was at 2024-04-22 04:01:36.480673+00

Differences from VM version:

Instead of systemd, here the pebble is used as service management tool. There is some hack, when charm is running pebble logs postgresql command instead of calling it through API (PostgresqlOperatorCharm.is_pitr_failed), but it's not critical and can be improved later with adding required method to the pebble client.
There are minor differences between events in VM and K8s charms, but main functionality works.

codecov · 2024-07-17T09:40:58Z

Codecov Report

Attention: Patch coverage is 30.61224% with 102 lines in your changes missing coverage. Please review.

Project coverage is 68.74%. Comparing base (d027e56) to head (636d97e).

Files	Patch %	Lines
src/charm.py	26.08%	45 Missing and 6 partials ⚠️
src/backups.py	38.98%	26 Missing and 10 partials ⚠️
src/patroni.py	11.76%	15 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #554      +/-   ##
==========================================
- Coverage   70.94%   68.74%   -2.21%     
==========================================
  Files          10       10              
  Lines        2853     2972     +119     
  Branches      536      563      +27     
==========================================
+ Hits         2024     2043      +19     
- Misses        726      811      +85     
- Partials      103      118      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Conflicts: # src/charm.py

marceloneppel

LGTM! Thanks @Zvirovyi!

One last change is needed to make on of the backup tests pass.

In tests/integration/test_backups.py's test_restore_on_new_cluster(), you need to replace the following code (starting at line 290):

# Wait for the restore to complete.
    async with ops_test.fast_forward():
        await wait_for_idle_on_blocked(
            ops_test,
            database_app_name,
            0,
            S3_INTEGRATOR_APP_NAME,
            ANOTHER_CLUSTER_REPOSITORY_ERROR_MESSAGE,
        )

With:

# Wait for the restore to complete.
    async with ops_test.fast_forward():
        await wait_for_idle_on_blocked(
            ops_test,
            database_app_name,
            0,
            S3_INTEGRATOR_APP_NAME,
            MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET, # <--- The constant needs to be changed to this one.
        )

The other backup tests are already passing.

# Conflicts: # src/charm.py

Zvirovyi added 4 commits June 24, 2024 14:57

Initial PITR work.

e8395ff

Merge branch 'refs/heads/main' into pitr

16b0389

Merge branch 'refs/heads/main' into pitr

f1e7d1f

Initial PITR work.

90ab2a5

dragomirp requested review from marceloneppel, taurus-forever, lucasgameiroborges and dragomirp July 5, 2024 10:55

Zvirovyi added 6 commits July 5, 2024 16:26

PITR.

1d3ed62

Lint.

2ad5b3d

S3 WAL stanza check.

f1f0743

PITR: fix Patroni.last_postgresql_logs.

1403ec1

Merge branch 'refs/heads/main' into pitr

e3c2739

Unit tests.

a57bd0c

Zvirovyi added 8 commits July 25, 2024 18:02

Merge branch 'refs/heads/main' into pitr

109d45e

# Conflicts: # src/charm.py

Format.

f545cbb

Fix s3 change event.

648e7a0

Fix PITR s3 statuses.

3165304

PITR integration tests.

bf5694f

PITR integration tests.

53e2aff

backup test works.

f4579b3

PITR tests.

5f5ba2d

Zvirovyi marked this pull request as ready for review July 25, 2024 22:45

Zvirovyi mentioned this pull request Jul 26, 2024

Scaling breaks "The S3 repository has backups from another cluster" blocked status #591

Open

marceloneppel requested changes Jul 26, 2024

View reviewed changes

Zvirovyi added 2 commits July 26, 2024 18:51

Fix backup test.

c91323a

Merge branch 'refs/heads/main' into pitr

636d97e

# Conflicts: # src/charm.py

dragomirp approved these changes Jul 26, 2024

View reviewed changes

marceloneppel approved these changes Jul 26, 2024

View reviewed changes

marceloneppel merged commit 8ab2074 into canonical:main Jul 26, 2024
83 of 85 checks passed

This was referenced Aug 5, 2024

Scaling breaks "The S3 repository has backups from another cluster" blocked status canonical/postgresql-operator#572

Open

Successful restore does not trigger all the related events #622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-4839] Point In Time Recovery #554

[DPE-4839] Point In Time Recovery #554

Zvirovyi commented Jul 5, 2024 •

edited

Loading

codecov bot commented Jul 17, 2024 •

edited

Loading

marceloneppel left a comment

[DPE-4839] Point In Time Recovery #554

[DPE-4839] Point In Time Recovery #554

Conversation

Zvirovyi commented Jul 5, 2024 • edited Loading

Point In Time Recovery

Overview

Use-case example

Must be known

Move restored cluster to another S3 bucket

For failed PITR, juju debug-log will report latest transaction time

Differences from VM version:

codecov bot commented Jul 17, 2024 • edited Loading

Codecov Report

marceloneppel left a comment

Choose a reason for hiding this comment

Zvirovyi commented Jul 5, 2024 •

edited

Loading

For failed PITR, `juju debug-log` will report latest transaction time

codecov bot commented Jul 17, 2024 •

edited

Loading