Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4839] Point In Time Recovery #554

Merged
merged 20 commits into from
Jul 26, 2024
Merged

Conversation

Zvirovyi
Copy link
Contributor

@Zvirovyi Zvirovyi commented Jul 5, 2024

This is work on porting the same functionality from the VM.
Resolves #553.

Point In Time Recovery

Add 'restore-to-time' parameter for 'restore' action to be able to do point-in-time-recovery. Also, add 'test_pitr_backup' integration test.

Overview

Whether PostgreSQL charm is active, it writes write-ahead log (WAL) - log of all the transactions with corresponding times. Periodically (mostly based on minimum size constraint) new WAL file is generated and then is archived to the stanza through s3-integrator relation. This WAL can be useful, if you want to restore not only backup, but transactions occurred after that - this is named Point-In-Time-Recovery.

Use-case example

Let's assume, that you have model with PostgreSQL and s3-integrator units, both are configured and integrated between each other.

Firstly, you will need standard backup - it will be base for our later WAL applying:

juju run postgresql/leader create-backup

Then, you need to do some transactions and switch WAL file manually (for showcase):

create table asd(message text);
select current_timestamp; # 2024-04-22 06:54:42.280896+00
insert into asd values ('test1');
select current_timestamp; # 2024-04-22 06:55:08.442248+00
insert into asd values ('test2');
select current_timestamp; # 2024-04-22 06:55:20.442248+00 (not possible to reach as last transaction occurred in the row above)
select pg_switch_wal();

Shortly after that, you can start your PITR:

juju run postgresql/leader restore restore-to-time="2024-04-22 06:54:42.280896+00"

Also, in this example as target you can use second timestamp, but not third, as third was selected after all the transactions. Basically, DB knows only the time of transactions, and time outside of them is unknown. So, if you try to run PITR with third timestamp, it will fail with error "cannot reach PITR target".

For that cases when you need to apply all of the transactions in the WAL, you can use juju run postgresql/leader restore restore-to-time="latest".

After any restore (even ordinary full backup), database will restore to the status "Move restored cluster to another S3 bucket".

Must be known

Move restored cluster to another S3 bucket

Database will now become in "Move restored cluster to another S3 bucket" status after every recovery. This is made with purpose to warn user: restoring backup will overwrite current WAL timeline, and this will separate tree of events starting from the backup starting point.

For example:

  1. User has full backup on day 1 and WAL for day 1 - day 5
  2. Then, user decided to recover to day 2 (with same S3 bucket re-switch) and filled WAL through day 5 - day 8
  3. From now, user has lost his opportunity to recover to day 2 - day 5 period, as new timeline became the latest timeline

Technically, day 2 - day 5 period will still be in WAL archive - only opportunity to use it will be lost.

So, requiring user to change stanza, ensures no such data lost.

For failed PITR, juju debug-log will report latest transaction time

unit-postgresql-0: 07:07:48 ERROR unit.postgresql/0.juju-log Restore failed: database service failed to reach point-in-time-recovery target. You can launch another restore with different parameters
unit-postgresql-0: 07:07:48 ERROR unit.postgresql/0.juju-log Last completed transaction was at 2024-04-22 04:01:36.480673+00

Differences from VM version:

  1. Instead of systemd, here the pebble is used as service management tool. There is some hack, when charm is running pebble logs postgresql command instead of calling it through API (PostgresqlOperatorCharm.is_pitr_failed), but it's not critical and can be improved later with adding required method to the pebble client.
  2. There are minor differences between events in VM and K8s charms, but main functionality works.

Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 30.61224% with 102 lines in your changes missing coverage. Please review.

Project coverage is 68.74%. Comparing base (d027e56) to head (636d97e).

Files Patch % Lines
src/charm.py 26.08% 45 Missing and 6 partials ⚠️
src/backups.py 38.98% 26 Missing and 10 partials ⚠️
src/patroni.py 11.76% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #554      +/-   ##
==========================================
- Coverage   70.94%   68.74%   -2.21%     
==========================================
  Files          10       10              
  Lines        2853     2972     +119     
  Branches      536      563      +27     
==========================================
+ Hits         2024     2043      +19     
- Misses        726      811      +85     
- Partials      103      118      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @Zvirovyi!

One last change is needed to make on of the backup tests pass.

In tests/integration/test_backups.py's test_restore_on_new_cluster(), you need to replace the following code (starting at line 290):

# Wait for the restore to complete.
    async with ops_test.fast_forward():
        await wait_for_idle_on_blocked(
            ops_test,
            database_app_name,
            0,
            S3_INTEGRATOR_APP_NAME,
            ANOTHER_CLUSTER_REPOSITORY_ERROR_MESSAGE,
        )

With:

# Wait for the restore to complete.
    async with ops_test.fast_forward():
        await wait_for_idle_on_blocked(
            ops_test,
            database_app_name,
            0,
            S3_INTEGRATOR_APP_NAME,
            MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET, # <--- The constant needs to be changed to this one.
        )

The other backup tests are already passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Point-In-Time-Recovery
3 participants