Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass store object reuse when FODs change source description #7999

Open
tejing1 opened this issue Mar 8, 2023 · 21 comments
Open

Bypass store object reuse when FODs change source description #7999

tejing1 opened this issue Mar 8, 2023 · 21 comments
Labels
feature Feature request or proposal

Comments

@tejing1
Copy link

tejing1 commented Mar 8, 2023

Is your feature request related to a problem? Please describe.

I think just about everyone who writes nix derivations has tripped over this problem at least once:

  • You change a source url in a fetcher (perhaps indirectly, by changing another field such as version), but forget to change the hash.
  • The build silently succeeds using the previous data, and then the problem shows up down the line, either because you find you're not actually using the version you think you are, or because the derivation fails to build on another machine later.

This problem should be conquerable, without any real compromises. It's just going to take some work.

Edit: To clarify, these are the properties this solution has, unless I'm sorely mistaken:

  • Changing fetchers while retaining the hash doesn't cause a rebuild of the reverse dependencies.
  • Changed fetchers that are inherently flawed will still run and fail, rather than silently "succeeding"... until they don't. Fail or succeed correctly!
  • If a build succeeds, and only uses FODs which set this field sanely, then you should be able to be confident it would also succeed elsewhere. Reproducibility! (well, requireFile doesn't count)
  • If you really want to avoid the redownload (say the file is just plain huge), you should be able to change fetchers without testing the new one if you use the right nix-prefetch-* incantations to add the new nix db entry to the existing store object. (We'll need to add options to those tools for this, but it should be easy enough to do so) Flexibility!

Describe the solution you'd like

Nix should accept an extra field in fixed output derivations which contains a string, opaque to nix, whose hash will be stored in the nix database, associated to the output store path, when the FOD is successfully built. Nix will not blindly reuse the store object by name if the hash of the extra field specified in the FOD being evaluated does not match any of the stored field hashes associated to that path in the nix database. It will run the FOD instead, adding the new field hash to the database if it succeeds. Nix can throw away the result of the FOD, and reuse the existing store path, but only after running it to verify that it does produce an output with the same hash as what is already present.

If this field is not specified, then no entry is produced in the nix database, nor are the existing ones checked, just like the current behavior. This is important for certain fetchers such as requireFile. Similarly, it's important to modify the various prefetching commands to produce the relevant nix database associations, so that the fetchers they are meant to be compatible with will reuse the prefetched objects without complaint.

FODs that wish to make use of this feature must determine a string that is appropriately identifying for the purposes of this determination. For most FODs, it seems quite straightforward to do this. pkgs.fetchurl, for example, should use the url itself for this field (probably prefixed with fetchurl: or something, just to avoid any accidental collision between different fetchers). For some FODs, it might make more sense to serialize a datastructure of some kind into the string. This determination does not have to be 100% rock solid, as it only comes into play when nix expression writers make a mistake. A 99% solution is plenty here.

This design should ensure bi-directional backward compatibility, since new expressions on an old version of nix should simply set an environment variable which is not used, and a new nix evaluating old expressions will act identically due to the field not being present (assuming we name it something that avoids collision well enough). New prefetchers should also work with old expressions, since the extra database entry won't hurt anything. However old prefetchers may not work correctly for new expressions, causing the FOD to be re-run, wasting resources, but still succeeding if it should have done so.

The interaction with substituters needs some more thought. Somehow, the associated field hashes need to be communicated from the cache to the substituting nix instance, before substitution, so that it can determine whether it's acceptable to use the store object from the cache or not. I don't have a specific plan for how to communicate this, but it seems like a clearly solvable problem.

Most nix expression writers would never need to know about this change at all, as they simply use nixpkgs fetchers, and those could implement this internally. The problem would simply be fixed.

Edit2: Due to discussion below, it seems clear this is best implemented on top of ca-derivations. It's essentially a hybrid of floating ca and fixed-output derivations. You pretend you don't know the output store path in advance (even though you can actually predict it) like with floating ca, but also allow network access and enforce a specific output hash like a fixed output derivation, and you use a separate trust map in the nix database, different from the normal ca trust map, which hashes custom information provided by the fetcher, rather than everything about the inputs.

Describe alternatives you've considered

  • Including additional information in the name field of FODs instead. This causes the store paths of derivations that depend on the FOD to change when the download source is changed, which is undesirable.
  • Using the input hash of the FOD as the identifying information. This is overly specific, however, causing frequent spurious redownloads of, for example, fetchurls when a new version of curl comes out (or even just a new version of one of its dependencies).
  • Various less comprehensive heuristic solutions might be able to manage to warn the user when a possibly unintended reuse is taking place, but I believe a more comprehensive solution that "just works" is far more appropriate.

Priorities

Add 👍 to issues you find important.

@tejing1 tejing1 added the feature Feature request or proposal label Mar 8, 2023
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/solving-the-forgot-to-change-the-hash-problem/26114/1

@Ericson2314
Copy link
Member

If we do NixOS/rfcs#133, we can just use a single git hash, and then I think we side-step this problem entirely for the vast majority of cases.

@tejing1
Copy link
Author

tejing1 commented Mar 8, 2023

That would certainly sidestep the problem in a significant number of cases, if I understand the implications correctly, but it still seems to me that there are a fair number of situations that would require this proposal. Dependency FODs for *2nix tools, for example. Also, many of the older open source projects are released by tarball, without relation to git, and fetched in nixpkgs with fetchurl.

@Ericson2314
Copy link
Member

I will need to read what you wrote here more carefully, but so far essentially any deviation from simple output hash based caching is a lot more trouble than it sounds, because it really breaks the core Nix model.

Today, if we have something at the desired store path, we're done. Period. Simple. Any departure from that is a nightmare.

The best bet for tarballs and other such things is probably some sort of impure CA derivation, with a separate "assertion" step.

@tejing1
Copy link
Author

tejing1 commented Mar 8, 2023

This really shouldn't break the model. It's essentially selectively applying nix build's --rebuild switch to the FOD based on some information in the nix db. The only functional difference is that this will sometimes fail where it didn't before, but arguably should have.

Edit: It does occur to me, however, that nix's current job scheduler may not be capable of handling this without some work, as in some cases it needs to rebuild a (possibly deep, and possibly only build-time) dependency of a derivation, but then just use or substitute an existing store object for the derivation itself.

@NobbZ
Copy link
Contributor

NobbZ commented Mar 8, 2023

So instead of hash I have to remember to change another field that I need to be aware of?

What is the difference to touching/removing the hash?

@tejing1
Copy link
Author

tejing1 commented Mar 8, 2023

So instead of hash I have to remember to change another field that I need to be aware of?

What is the difference to touching/removing the hash?

No, normally you'd use a nixpkgs fetcher, and with this change, those would need to be changed to automatically generate this field from their existing arguments. So with pkgs.fetchurl, this field would be automatically filled in based on the url argument.

Then if you change the url argument, but not the hash, nix would rerun the FOD, failing if it didn't produce something with the given hash.

@NobbZ
Copy link
Contributor

NobbZ commented Mar 8, 2023

You said current expressions behaviour wouldn't change, therefore I expected I had to always pass the value…

The problem with letting the fetchers generate this value:

You can not disable the behaviour if you for some reason want it. You can not replace a fetchFromGitHub with an equivalent fetchTarball anymore.

We are now at a point where we could just merge the PR that set the name field.

@tejing1
Copy link
Author

tejing1 commented Mar 8, 2023

You said current expressions behaviour wouldn't change, therefore I expected I had to always pass the value…

I meant the entire expression, nixpkgs version and all.

The problem with letting the fetchers generate this value:

You can not disable the behaviour if you for some reason want it. You can not replace a fetchFromGitHub with an equivalent fetchTarball anymore.

You can, it will just download it again, to check that your new FOD isn't broken. One extra download isn't normally really a big deal. Also, the fetchers can accept an extra optional argument to suppress or override this field if the caller wants to.

In fact, we could add options to the prefetch commands that tell them to use the existing store object without regard for this field, then generate an entry in the nix db, bypassing the redownload, if the user really wants to.

We are now at a point where we could just merge the PR that set the name field.

No, because the store path doesn't change, so you don't rebuild the reverse dependencies. Just the FOD itself. If the FOD succeeds, the rest can be reused.

@NobbZ
Copy link
Contributor

NobbZ commented Mar 8, 2023

Okay, I see the difference, and to be honest, I would appreciate any solution to the problem, though at the same time, any solution that can't transparently swap the fetchers for the same content, will be doomed to be declined.

Perhaps __impure can be a solution? Though that will just move the hash up the chain.

@7c6f434c
Copy link
Member

7c6f434c commented Mar 8, 2023

Previous discussions of other solutions, because GitHub search is useless for NixOS org:

NixOS/nixpkgs#46432
#969
NixOS/nixpkgs#153386

I think tarballs.nixos.org one is the most relevant here (and if tarballs.nixos.org being unreasonably helpful is considered a feature, this sounds bad for the current proposal). But given that the current proposal forces just a refetch instead of a rebuild, and the cited use case for tarballs.nixos.org would work with use of a binary cache instead, maybe…

@tejing1
Copy link
Author

tejing1 commented Mar 8, 2023

From what I've read of other discussions on the issue, I think this gets the best of both worlds, at the expense of more implementation work:

  • Changing fetchers while retaining the hash doesn't cause a rebuild of the reverse dependencies.
  • Changed fetchers that are inherently flawed will still run and fail, rather than silently "succeeding"... until they don't. Fail or succeed correctly!
  • If a build succeeds, and only uses FODs which set this field sanely, then you should be able to be confident it would also succeed elsewhere. Reproducibility! (well, requireFile doesn't count)
  • If you really want to avoid the redownload (say the file is just plain huge), you should be able to change fetchers without testing the new one if you use the right nix-prefetch-* incantations to add the new nix db entry to the existing store object. (We'll need to add options to those tools for this, but it should be easy enough to do so) Flexibility!

@tomberek
Copy link
Contributor

tomberek commented Mar 9, 2023

Generally, I'm hesitant to have the DB responsible for controlling fetcher behavior.

Another approach to throw into the discussion: kolloch/nur-packages@77c0ad6#diff-3a23ad2794099664492d23369097dee6

ie: make something critical about the fetching part of the name, and thus part of the storePath. (eg: one can use url/version/serialized fetcher args)

@tejing1
Copy link
Author

tejing1 commented Mar 10, 2023

Saying the DB "controls fetcher behavior" in this proposal is a bit of a strange way of putting it. It's controlling whether nix runs the FOD or not, not the behavior of the FOD itself.

The problem with using the name is that the entire closure of referrers needs to be rebuilt unnecessarily if you change the source url, despite it being the exact same data. In the case of that implementation, it even needs to redownload every time there's a change to curl or stdenv.

@Ericson2314
Copy link
Member

@tejing1 All the logic today says "if the store path the dependency exist, the dependency is met". This is true both for store paths referencing other store paths, and derivations references other store paths or derivation outputs. It would be painful to add a special case to that.

@Ericson2314
Copy link
Member

The problem with using the name is that the entire closure of referrers needs to be rebuilt unnecessarily if you change the source url, despite it being the exact same data. In the case of that implementation, it even needs to redownload every time there's a change to curl or stdenv.

OK if that is your design requirement, then I think your best bet is moving away from fixed output derivation to special white listed derivations:

  • Special builtin-in builders for build time, not eval time, fetching.
  • Floating content addressed output (downstream derivations do not know what the hash is in advanced)
  • Impure to access network but don't show up as impure derivations (no "tainting" or in-memory temporary caching only)
  • Extra DB information is not a random extra column on the store path, but a special trust map mapping keys to store paths
    • This is basically saying pluggable trust maps for specially authenticated CA derivations rather than the default derivation-hash-keyed one, and then resuing the existing fetcher cache index.

@tejing1
Copy link
Author

tejing1 commented Mar 10, 2023

@Ericson2314 Your response seems to be based on the assumption that I want to be able to change the actual fetch result without changing the store path of the things that depend on it. That's definitely not the case. It breaks reproducibility.

I just want to continue to be able to do what you can already do right now with normal nixpkgs fetchers, which is replace a FOD with a completely different implementation, so long as it produces the exact same data. Making the FOD's name overly specific prevents that.

@Ericson2314
Copy link
Member

@tejing1 No it is not based on that assumptions. A floating CA derivation does not have a fixed output store path, the output store path is instead computed based on the actual contents once those contents are known. This influences downstream store paths too.

The output hash would not depend on anything but the content, just as you like.

Using a different fetching derivation would require fetching that way the first time, even though new new store object (since there is already one at that path) ends up being produces. I think this is also what you like.

@tejing1
Copy link
Author

tejing1 commented Mar 10, 2023

Hmm, ok, I think maybe I get it this time, at least partly. You're saying the existing CA derivation system can be used with relatively little change to accomplish what I'm going for here.

I'm not sure this is meaningfully different from from my proposal, though. A CA derivation normally has the property that you don't know the final store path in advance, but here you actually do, since it's still fundamentally fixed-output. The fetcher should fail if the output doesn't have the specified hash. You could pretend you don't know the output path, though, if it makes the implementation easier.

Extra DB information is not a random extra column on the store path, but a special trust map mapping keys to store paths

This was always the intention, though I may not have been clear enough about it. You need a separate table for this, not just an extra column, because a given store path can have an arbitrary number of associated field hashes. A field hash could have multiple associated store paths as well, though that would indicate that some FODs in use weren't as repeatable as they should be.

It is essentially a "pluggable trust map", as you put it, it's just that the extent of "trust" doesn't go as far as with CA derivations. The fixed output hash is still enforced, so the map isn't really part of the security model. Only one store object can possibly be produced, regardless of the state of the map. The integrity of the map is only relevant to the questions of:

  • How sure are you that the build would have actually succeeded on a blank store?
  • How often do you re-download things you actually already have a copy of?

@Ericson2314
Copy link
Member

Hmm, ok, I think maybe I get it this time, at least partly. You're saying the existing CA derivation system can be used with relatively little change to accomplish what I'm going for here.

Yup!

I'm not sure this is meaningfully different from from my proposal, though. A CA derivation normally has the property that you don't know the final store path in advance, but here you actually do, since it's still fundamentally fixed-output. The fetcher should fail if the output doesn't have the specified hash. You could pretend you don't know the output path, though, if it makes the implementation easier.

Yeah that's right. It gets the rebuilds you want, and it puts the pure and impure fetchers on the same footing. Really there are are 3 options:

  • no checksum (impure fetching)
  • checksum but double check URL/etc. (floating CA, pretend we don't have checksum, assert as part of fetching)
  • checksum and cache eagerly (fixed CA, i.e. fixed-output derivation)

This was always the intention, though I may not have been clear enough about it. You need a separate table for this, not just an extra column, because a given store path can have an arbitrary number of associated field hashes. A field hash could have multiple associated store paths as well, though that would indicate that some FODs in use weren't as repeatable as they should be.

Excellent!

@lolbinarycat
Copy link

hey, i've been contemplating the fixed-output derivation problem for a bit now, and after trying and failing to catch this in nixpkgs, i was about to suggest exactly this change.

one additional observation is that you don't actually have to perform the refetch before the build, as there are only two possible outcomes when changing the arguments to a fetcher without changing the hash:

  1. the source description changed, but the actual hash of the fetched source did not (eg. you switched rev from a git tag to the commit that tag points to)
  2. the hash argument did not change, but the actual hash of the derivation's output did (this is the mistake we are trying to catch)

in case number 1, it will always be correct to just build everything with the output available in the store (current behavior)

in case number 2, all that matters is that something somewhere in nix gives an error about the mismatched hashes.

importently, there is never a case where we need to redownload the source, notice the difference, then use the new source tarball instead of the old one. if a difference in output is detected, an error should always be raised.

this may potentially simplify the implementation significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature request or proposal
Projects
None yet
Development

No branches or pull requests

7 participants