Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builds started failing on Hydra's new hash-named x86 machines #64126

Closed
vcunat opened this issue Jul 2, 2019 · 11 comments
Closed

builds started failing on Hydra's new hash-named x86 machines #64126

vcunat opened this issue Jul 2, 2019 · 11 comments
Labels
0.kind: regression Something that worked before working no longer 1.severity: blocker

Comments

@vcunat
Copy link
Member

vcunat commented Jul 2, 2019

A couple days ago, i686 nixos tests started failing consistently, e.g. https://hydra.nixos.org/build/95616203 I can't reproduce the problem locally, and apparently there's something different about those hash-named build machines (which might have been added or at least changed around that time, too).

i686 tests aren't too important nowadays, I suppose, but we could at least do something simple, e.g. remove the i686 platform tag from these machines.

@veprbl veprbl added the 0.kind: regression Something that worked before working no longer label Jul 2, 2019
@vcunat
Copy link
Member Author

vcunat commented Jul 7, 2019

Eh, another problem, likely related and much worse – those machines quite often end an x86_64-linux build with:

checking for references to /build/ in /nix/store/gwawakcjhr48xgf04dhc16fkhw4xdnng-automake-1.15...
invalid ownership on file '/nix/store/gwawakcjhr48xgf04dhc16fkhw4xdnng-automake-1.15/bin/aclocal-1.15'

Again, I could never reproduce these.

/cc @NixOS/rfc-steering-committee I don't really have an idea about whom to ping, but some of them certainly should know about those new Hydra machines.

@vcunat vcunat changed the title i686 nixos tests started failing on Hydra builds started failing on Hydra's new hash-named x86 machines Jul 7, 2019
@vcunat
Copy link
Member Author

vcunat commented Jul 7, 2019

Apparently this currently blocks larger rebuilds, even with multiple restart attempts, e.g. see this build. /cc @FRidh who deals with staging-next a lot (to know I've posted a thread for this).

@FRidh
Copy link
Member

FRidh commented Jul 7, 2019

@grahamc have you seen this?

@grahamc
Copy link
Member

grahamc commented Jul 7, 2019

No, I haven't seen this.

@grahamc
Copy link
Member

grahamc commented Jul 7, 2019

These hash-named x86 machines have Intel Scalable Gold 5120 cpus, and are transient -- so it is a bit lucky that this exact build's machine still exists.

However, unlucky because I can't log in to it:

$ ssh [email protected]
[SOS Session Ready. Use ~? for help.]
[Note: You may need to press RETURN or Ctrl+L to get a prompt.]

nixos login: grahamc
^C
[grahamc@Petunia:~]$ ssh [email protected]
^C

Evidently something very strange happened to it. I've since destroyed that server.

I picked up another one of the machines (b5b77143) which is alive, and boy did something stick out to me!

Look at this selection from top:

top - 10:38:34 up 8 days, 23:38,  1 user,  load average: 4.55, 4.39, 3.99
Tasks: 721 total,   2 running, 717 sleeping,   0 stopped,   2 zombie
%Cpu(s):  2.4 us,  3.0 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 385660.9 total,  64843.6 free, 112842.9 used, 207974.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 266543.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  809 root      20   0 8775624   2.7g    728 S 134.4   0.7   4421:02 unionfs
35199 nixbld10  20   0   15456  10492   2488 S  25.8   0.0   0:01.23 perl
35734 nixbld14  20   0   74632  49628  11960 R   8.3   0.0   0:00.25 cc1plus

@grahamc
Copy link
Member

grahamc commented Jul 7, 2019

How much would you bet unionfs is the problem? :)

These machines are spot and transient, so they never fully "install". In the x86 case I accidentally left the / mount as a unionfs of the netboot image and a mount point on disk. I've killed these problematic x86 machines until I can fix this by moving the netboot'd / on to the actual disk to avoid unionfs problems.

@grahamc
Copy link
Member

grahamc commented Jul 7, 2019

Also, I'm sorry for breaking it, and not noticing sooner. Thank you @vcunat for tracking it down, and thank you @FRidh for the ping!

@vcunat
Copy link
Member Author

vcunat commented Jul 8, 2019

@grahamc: thanks for the quick reaction. I should've tried to mention you directly; now I'll know who knows best about these builders, too.

The aarch64-linux ones also suffer from this, apparently: this build (step 6).

@grahamc
Copy link
Member

grahamc commented Jul 8, 2019

Yes, indeed. I have terminated those now as well. Same problem with unionfs. I'm traveling this week, which makes it a bit trickier to fix and re-launch these instances, but I'll give it a go!

Thank you for the heads up.

@grahamc
Copy link
Member

grahamc commented Jul 8, 2019

I have updated the filesystem layout, and / is now a zfs filesystem of its own, with no layering. I'll be re-launching the aarch64 and x86 builders with this new mechanism.

@FRidh
Copy link
Member

FRidh commented Jul 10, 2019

I think the issue has been resolved so closing.

@FRidh FRidh closed this as completed Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: regression Something that worked before working no longer 1.severity: blocker
Projects
None yet
Development

No branches or pull requests

4 participants