Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-hosted releases.nixos.org #408

Open
delroth opened this issue Apr 10, 2024 · 6 comments
Open

Self-hosted releases.nixos.org #408

delroth opened this issue Apr 10, 2024 · 6 comments
Assignees
Labels
new-service Request for a new service to be ran on the NixOS infra

Comments

@delroth
Copy link
Contributor

delroth commented Apr 10, 2024

Assuming that 100% of the eu-west-1 S3 bill is releases.nixos.org (there are a few other minor buckets, e.g. tarballs.nixos.org), these mostly static ~32TB of data are costing us between $1.5K-$3.5K/month right now1.

This is IMO a perfect opportunity to start ramping up S3 self-hosting for the NixOS infra. Unlike cache.nixos.org:

  • This is small enough to not require significant investment to migrate.
  • This is all derivative data that does not have too high durability risks (worst case, we should be able to reconstruct it).
  • All writes to the bucket are operated by the channel scripts, not via the Nix codebase, which means we have more control to easily do things like e.g. dual-write.

2x SX134 at Hetzner with 10Gbps uplinks would cost us ~$659/month for 2x {2 x 960 GB flash + 10 x 16 TB hard drive (128TB with 2 disks failure tolerance)}, we can then also use the extra capacity in the future to consider self-hosting parts of cache.nixos.org to offset bandwidth costs.

We would still CDN this via Fastly.

Footnotes

  1. The cost is rapidly increasing from factors that look organic in nature, but more analysis might be able to find artificial sources that are increasing the costs on S3. In any case, $1.5K looks like the minimal baseline costs we could get down to.

@delroth delroth added the new-service Request for a new service to be ran on the NixOS infra label Apr 10, 2024
@edolstra
Copy link
Member

The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases (see #397). Self-hosting sounds very risky to me. Releases are not in fact easy to reconstruct. If the release server were to die entirely, there is no way we can feasibly reconstruct it.

Self-hosting could be an option for releases that have been removed (or glaciered) from releases.nixos.org.

@delroth
Copy link
Contributor Author

delroth commented Apr 10, 2024

The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases

But... this does nothing for data transfer costs, which are 85% of the S3 bill? Am I missing something?

image

Releases are not in fact easy to reconstruct. If the release server were to die entirely, there is no way we can feasibly reconstruct it.

Can you be more clear here? What cannot be reconstructed? Channel scripts just fetch from Hydra + cache (from where AFAICT the data is not being removed) and runs two data extractor programs (nix-index + nix-generate-debuginfo) which, while slightly annoying, don't seem like it would be majorly difficult to re-run on old data.

@delroth
Copy link
Contributor Author

delroth commented Apr 10, 2024

Also, do you realize the contradiction in claiming this as a problem:

If the release server were to die entirely, there is no way we can feasibly reconstruct it.

But then also suggesting:

The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases

What data risk loss do you actually care about if you're suggesting deleting 75% of the data?

@edolstra
Copy link
Member

Reconstruction cannot depend on cache.nixos.org, since we're going to GC that too. In particular ISOs etc. will be deleted.

Also, our disaster recovery cannot involve running some script that doesn't exist and that would take days to run.

What data risk loss do you actually care about if you're suggesting deleting 75% of the data?

I care about the releases that we don't expunge (and the ones that we do would be on Glacier, so we can always bring them back).

The bandwidth increase in eu-west-1 is weird since in March 2023 it was 1832 GB ($54.96) and as recent as October 2023 is was just 1865 GB ($167.93). So the increase to 34204 GB ($2958.59) is hard to explain. Maybe there is a Fastly misconfiguration that is causing the CDN to be less effective?

@delroth
Copy link
Contributor Author

delroth commented Apr 10, 2024

Reconstruction cannot depend on cache.nixos.org, since we're going to GC that too. In particular ISOs etc. will be deleted.

OK, but then you also don't care about reconstruction, since you're deleting the original data. Note that I still think that's a terrible idea for stuff that's linked to a channel bump (and only stuff that was a channel version at some point would be on releases.nixos.org). cc @edef1c because I was not under the impression that this was the plan

Also, our disaster recovery cannot involve running some script that doesn't exist and that would take days to run.

I really don't see why not. Unlike the cache, releases.nixos.org is not in much of a critical path, and the only stuff that really needs to recover quickly and have high availability would be the latest version for each channel.

Maybe there is a Fastly misconfiguration that is causing the CDN to be less effective?

Not that I can tell, and there have been no configuration changes since Sept 2023. I'm waiting for @zimbatm to provision me the right AWS access to look through the Athena logs.

@zimbatm
Copy link
Member

zimbatm commented May 21, 2024

@edef1c is this still relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-service Request for a new service to be ran on the NixOS infra
Projects
None yet
Development

No branches or pull requests

4 participants