Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursion in builders definitions deadlocks builds with waiting for lock on '/nix/store/...' #10740

Open
tomeon opened this issue May 19, 2024 · 5 comments
Labels
bug idea approved The given proposal has been discussed and approved by the Nix team. An implementation is welcome. protocol Things involving the daemon protocol & compatibility issues scheduling

Comments

@tomeon
Copy link

tomeon commented May 19, 2024

Describe the bug

I've been trying to set up remote builders following the NixOS wiki's "Distributed build" article and other resources. Because none of the machines in my home lab are particularly beefy, and I want to recruit as much compute as I can, I've got cycles in the builders graph; that is, machines B and C appear in the builders definition for machine A, machines A and C appear in the builders definition for machine B, and machines A and B appear in the builders definition for machine C (and so on).

This appears to lead to deadlocks in builds, one symptom of which is the appearance of warning messages following the pattern waiting for lock on '/nix/store/...'.

If I initiate a build on machine A, I can observe that it starts a nix-daemon --stdio process on machine B, and that machine B in turn starts a nix-daemon --stdio process on machine A.

Steps To Reproduce

On machine A, add machine B to the builders definition in /etc/nix/nix.conf. Similarly, on machine B, add machine A to the builders definition in /etc/nix/nix.conf. Then, execute a nontrivial build (e.g. rebuild a NixOS machine configuration).

Expected behavior

I would expect machines defined in builders to either:

  1. Only run builds locally and not farm out work to their own builders machines, or
  2. Eliminate the build-initiating machine from the machines in builders, if it appears there.

nix-env --version output

$ nix-env --version
nix-env (Nix) 2.18.1

This is the same on the machine that initiates builds as well as on the machines in builders.

Additional context

The proximate cause of this issue may be the same as in #2029.

Priorities

Add 👍 to issues you find important.

@tomeon tomeon added the bug label May 19, 2024
@fricklerhandwerk fricklerhandwerk added protocol Things involving the daemon protocol & compatibility issues scheduling labels May 22, 2024
@fricklerhandwerk
Copy link
Contributor

fricklerhandwerk commented May 22, 2024

Triaged in Nix team meeting:

  • @roberth: currently ofBorg doesn't have remote builders because they would suffer from the same issue; this makes NixOS passthru tests harder for macOS (you'd have to remove them or they would fail, both of which would be unfortunate)
  • @Ericson2314: why trying to make it cyclic to begin with?
  • @fricklerhandwerk we could at least write the fact into documentation... it doesn't seem like a serious enough problem to invest energy into fixing it.
  • @roberth: an easy to implement fix would be passing builders = "" through the daemon protocol, and there would be no transitive remote building, maybe even optionally (if you want to send builds to online services, which would pass it on)
    • the ideal solution would of course be implementing cycle detection
  • @Ericson2314: related an ~exhaustive list of issues with remote building in case one starts attacking the broader problem space
  • contributions welcome for this!

@fricklerhandwerk fricklerhandwerk added the idea approved The given proposal has been discussed and approved by the Nix team. An implementation is welcome. label May 22, 2024
@bryanhonof
Copy link
Member

#1914 is maybe relevant?

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2024-05-22-nix-team-meeting-minutes-147/45835/1

@tomeon
Copy link
Author

tomeon commented May 23, 2024

Wow, thanks for the prompt attention :)

@Ericson2314 -

[W]hy trying to make it cyclic to begin with?

I would like to be able to initiate builds from any one of the machines A, B, C, etc., and recruit the other machines as builders. I am not trying to make it cyclic; that's just an effect of the remote builder implementation that I didn't anticipate.

@fricklerhandwerk -- it's almost certainly a niche issue, though as you say it would still be nice to document it. As the above-linked "Developing a system that replaces nix remote build" Discourse thread documents thoroughly, the current remote builder implementation gives you not just the banana (having remote machines build stuff) but the gorilla that's holding it, and the entire jungle besides (IIUC, something like the complete Nix configuration on the remote builders, including their builders settings, certain timeout settings, etc.). This is surprising, at least to me.

[C]ontributions welcome for this!

Haven't touched C++ since college; maybe time to see how badly those muscles have atrophied 😅

To confirm, contributions are welcome for any/all of the following?:

  1. Documenting this wrinkle of remote building,
  2. (Optionally?) passing builders = "" through the daemon protocol, and
  3. Implementing remote builder cycle detection.

@roberth
Copy link
Member

roberth commented May 23, 2024

it's almost certainly a niche issue

I don't think it is; see my comment about the ofborg nodes not having remote builders. That issue is soft-blocked because some others were aware of this problem.

are welcome

1 and 2 are definitely welcomed.

I wouldn't recommend 3, because it is a lot more work to design develop and review, whereas I believe 2 is much simpler and covers the current use cases.

Perhaps 3 is best reframed as part of a larger project to make remote building more self-configuring and/or dynamic. For instance, the scheduler doesn't discover the remote metadata such as system features. It seems that those require similar solutions, or changes in the protocol etc that benefit all scheduling related info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug idea approved The given proposal has been discussed and approved by the Nix team. An implementation is welcome. protocol Things involving the daemon protocol & compatibility issues scheduling
Projects
None yet
Development

No branches or pull requests

5 participants