Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

haumea/zrepl: reduce snapshot count #447

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

haumea/zrepl: reduce snapshot count #447

wants to merge 5 commits into from

Conversation

vcunat
Copy link
Member

@vcunat vcunat commented Jun 23, 2024

Reduce snapshot count. We repeatedly run out of space on Haumea.

@@ -33,23 +33,27 @@
};
pruning = {
keep_sender = [
{ type = "not_replicated"; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I expect we need have the regex = part here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe not. Their examples set doesn't have it:
https://zrepl.github.io/configuration/prune.html#pruning-policies

Copy link
Member Author

@vcunat vcunat Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should drop this line anyway. In case the remote end is down, we probably want to keep pruning the sender to reduce the risk of running out of space. Also it might reduce the time to sync up when the receiver becomes reachable again.

@vcunat
Copy link
Member Author

vcunat commented Jun 24, 2024

Let me dump a bit about why.

The problematic situations that we see have lots of data unique to (some) in-between snapshots, i.e. dropping some of those snapshots (manually) could release lots of space. Consequently:

  • making/transferring snapshots less often should decrease the total transfer amount and reduce this pressure on the remote
  • larger spacing between snapshots kept on Haumea should decrease the total space needed there, perhaps even if we didn't significantly decrease the total time span covered by snapshots

EDIT:

  • interesting note is that those problematic snapshots seem to have also larger total size (i.e. if we didn't have any snapshotting, the disk usage would be larger at those points)

@vcunat
Copy link
Member Author

vcunat commented Jun 25, 2024

🤔 as for the backup location(s), it feels wasteful to keep every week uniformly for a year. Can you see any reason for it? I'd intuitively again go for some exponentially increasing spacing. I assume we can afford more space than on Haumea itself, so e.g. this slower one?

              "2x1h"
              "2x2h"
              "2x4h"
              "4x8h"
              # At this point the grid spans 2 days (-2h) by 10 snapshots.
              # (See note above about 8h -> 24h.)
              "2x1d"
              "2x2d"
              "2x4d"
              "2x8d"
              "2x16d"
              "2x32d"
              "2x64d"
              "2x128d"
              # At this point we keep 26 snapshots spanning 384--512 days (depends on moment),
              # with exponentially increasing spacing (almost).

Perhaps note the docs that the specified intervals do not overlap. All the intervals are stacked in the specified order and multiplicity, forming a fixed grid.

"1x2h"
"1x4h"
# "grid" acts weird if an interval isn't a whole-number multiple
# of the previous one, so we jump from 8h to 24h
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth trying to explain the weirdness. I base it not on actual experience but on definition in their docs – and how such a model behaves then when running continuously.

Reduce snapshot count.  We repeatedly run out of space on Haumea.
1/100 of defaults seemed excessive.  Suspected to cause issues.
Changed to 1/10 of defaults.
This should be fine, as we have a faster connection to receiver,
and the churn doesn't seem so significant anymore anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant