Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0122] IPFS CID optionally on narinfo in binary caches #122

Closed
wants to merge 2 commits into from

Conversation

lucasew
Copy link

@lucasew lucasew commented Mar 7, 2022

The idea is to provide the CID of the nar file from the binary cache optionally to allow reducing bandwidth costs and in some cases increase efficiency by allowing users to download the binary cache nar files over IPFS

Rendered

This RFC was abandoned by the author as their primary goal was saving upstream bandwidth in a controlled/very limited network with a lot of computers and simpler solutions using the existing binary cache infrastructure, like a local cache, were found.

@Ericson2314
Copy link
Member

We should be able to use the existing CA field for this. That has many other benefits, too. That is what we did in our IPFS Nix work.


IPFS is still not a present reality on the mainstream Nix ecosystem, altough it's not reliable to store long term data, it can reduce bandwith costs for both the servers and the clients but the question is where the NAR file could be obtained in IPFS.

Its not espected that, for example, cache.nixos.org would run a IPFS daemon for seeding but it could just calculate the hash using `ipfs add -nq $file` and provide it on the narinfo so other nodes can figure out alternative places to download the NAR files, even closer than a CDN could be.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little concern is that a given file doesn't have exactly one CID. Depending on how you chunk the file you can get effectively unlimited different CIDs. This isn't a problem when the CID distributor starts the seed and the CID stays live on the network because whatever CID is advertised will be fetched. However for the case like this is matters a lot, because different settings will result in a would-be seeder generating the wrong CID.

IIUC the current default for ipfs add is fixed-size blocks of 262144B each (aka size-262144). However for a nixpkgs cache where subsequent versions of a derivation may be largely similar it may make more sense to do a smarter chunker based on a rolling hash.

Anyways, the exact chunking mechanism is bikeshedding, but what do we want to do about this? I see a few main options.

  1. Put the chunker into the narinfo so it can be reproduced. (I don't know if there is a well defined standard format but current go-ipfs uses strings like size-262144 and rabin-2048-65536-131072 which are pretty easy to understand and unlikely to be ambiguous.)
  2. Declare a chunker upfront and expect people to use it. (We can revert to 1 in the future by adding the chunker information later).
  3. Convince cache.nixos.org to also run an IPFS node that advertises the CIDs that are advertised in the narinfo files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rsync has a pretty interesting algorithm for syncing files https://stackoverflow.com/questions/1535017/rolling-checksums-in-the-rsync-algorithm , there maybe something in that, However probably not directly portable to IPFS and chunking.

I'd vote for 3! and get that working today (or perhaps tomorrow) and think about options 1/2 for the day after tomorrow (or some point in the future).

Thanks for your detailed analysis of this, my understanding of Nars on IPFS has increased!

Copy link
Contributor

@kevincox kevincox Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically equivalent to the Rabin chunking. But the biggest problem isn't what algorithm to use but how to know what algorithm was used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this we could do like how we already do with hashes, like sha256:something

AFAIK ipfs has symbol friendly names for the chunking methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't care about the chunking algorithm. Please stop discussing this here.

What I care about is that we record the chunking algorithm in a way that someone who wishes to advertise this path can do so.

@edolstra edolstra added status: new status: open for nominations Open for shepherding team nominations and removed status: new labels Mar 23, 2022
@edolstra
Copy link
Member

This RFC is now open for shepherd nominations!

@Ericson2314
Copy link
Member

I supposed I could shepherd this, but really I want and should soon be able to write a counter-proposal RFC for the work we did in 2020. So perhaps there ought to be one shepherd team for two "competing" RFCs (though it's really more prioritizing features than actually disagreement).

@tomberek
Copy link
Contributor

I'll volunteer as shepherd. (note from RFCSC: need a few more nominations in the next few weeks, otherwise this will be put on standby)

@Ericson2314
Copy link
Member

Ericson2314 commented Apr 20, 2022

I am now thinking this is probably fine as a complement.

We did a lot of different things in our 2020 IPFS × Nix saga, but thing thing I would like to focus on first is distributing and archiving source code. Conversely, this mainly about build artifact. Thus, no conflict! I am confident the two approaches will bore the "tunnel" from both ends, and so there will be a grand meeting in the middle eventually.

The one thing I would do is generalize so instead of thinking of IPFS in particular, we think "narinfo" (ValidPathInfo the C++ type) can have a list of "auxiliary" content addresses useful for fetching via other systems. They are "auxiliary" in the sense that they don't effect how the store path is computed. In fact, we can ret-conn today's NAR hash as just another auxiliary content address!

@Ericson2314
Copy link
Member

Ericson2314 commented Apr 20, 2022

You might take a look at NixOS/nix#3727, which I locally fixed conflicts with. (Tests however, are broken. Still debugging, so didn't push yet.)

That goes a few steps in trying to put the narinfos in IPFS as IPLD rather than files too, but this should be complementary:

  • We should make a new JSON narinfo format for "regular" binary caches too, as the current line-oriented file makes backwards compatible evolution too hard. (We can simply upload both types of narinfo for compatibility with old versions of Nix if we like.)

If we do that we can also share lots of code between both approaches:

  • All the "how do I talk to IPFS" legwork can of course be shared.
  • The JSON serialization can be shared between a "native" IPLD narinfo and legacy file version
  • Code to deal with getting the file data to/from IPFS can be shared.

@kamadorueda
Copy link
Member

I nominate myself as a shepherd

@Ericson2314
Copy link
Member

Ericson2314 commented Apr 21, 2022

Looks like we have the required number! :)

@lheckemann lheckemann added status: in discussion and removed status: open for nominations Open for shepherding team nominations labels May 4, 2022
@Ericson2314
Copy link
Member

#nix-rfc-122:matrix.org

@edolstra
Copy link
Member

Any updates on the status of this RFC?

@lucasew
Copy link
Author

lucasew commented Jun 15, 2022

We (or I) need to build a proof of concept. Maybe we will pivot this RFC to an LRU-based cache proxy approach at the beginning and iterate to a p2p approach if necessary, but I am without time to test it now, I am very busy because of the end of the semester.

The plan is to apply that prototype to an organization to reduce internet usage with things people often need, so, that prototype should be working until the end of the year, or I definitely will not get my degree by the end of the year xD.

@lheckemann
Copy link
Member

Sounds good! On behalf of the Steering Committee, I'd like to suggest moving the RFC to draft status until then --- any objections?

@lucasew lucasew marked this pull request as draft June 29, 2022 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants