Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ghc won't bootstrap on aarch64-linux anymore #97407

Closed
vcunat opened this issue Sep 7, 2020 · 57 comments
Closed

ghc won't bootstrap on aarch64-linux anymore #97407

vcunat opened this issue Sep 7, 2020 · 57 comments
Labels
0.kind: bug 0.kind: regression Something that worked before working no longer 6.topic: haskell

Comments

@vcunat
Copy link
Member

vcunat commented Sep 7, 2020

GHC boostrap on aarch64 is broken. I can't see any relevant change (introduced somewhere within #97146), and I can reproduce the segfault on the shared box. Note that also the newly branched-off 20.09 is affected.

I'm a bit sorry about letting such a big regression (in terms of package count) to master, but we also wanted to quickly fix the nixos-unstable channel and unblock the 20.09 process.
Maintainers: @MarcWeber, @kosmikus, @peti

@vcunat
Copy link
Member Author

vcunat commented Sep 8, 2020

Bisected to patchelf update: f38ed04

@vcunat vcunat mentioned this issue Sep 8, 2020
10 tasks
@domenkozar
Copy link
Member

Maybe we can try reverting b930b2d

@vcunat
Copy link
Member Author

vcunat commented Sep 8, 2020

According to a quick test, that would only delay the segfault to later during bootstrap (hscolour, happy).

@domenkozar
Copy link
Member

That's quite sad as one of the major reasons for a release was better aarcht64 support.

cc @delroth for ideas

@delroth
Copy link
Contributor

delroth commented Sep 8, 2020

I'll look into it. For now can this be worked around by having ghc built with an older patchelf, or is this also breaking other binaries built with ghc?

Would be helpful if you could attach an ELF that works pre-patchelf, the patchelf invocation, and the resulting segfaulting binary (I imagine building ghc takes forever, and I only have a "weak" ARMv8 builder machine.)

@vcunat
Copy link
Member Author

vcunat commented Sep 8, 2020

Overriding patchelf just for the single build is not enough, unfortunately.

@domenkozar
Copy link
Member

domenkozar commented Sep 8, 2020

@delroth building the binary package should be really fast as it only downloads the binary and patchelfs it so it can run with Nix. See https://nix-cache.s3.amazonaws.com/log/wm7l4xaq4jk8w5r9kscb4h6cibjqccjz-ghc-8.6.5-binary.drv

@delroth
Copy link
Contributor

delroth commented Sep 9, 2020

Interestingly, the segfault doesn't reproduce on my machine with 64K pagesize. That's... very unexpected, I can't really think of how this would happen (the opposite is usually the problem since alignment requirement are more restrictive). I'll spin up a VM on EC2 for repro I guess.

@vcunat
Copy link
Member Author

vcunat commented Sep 9, 2020

[vcunat@aarch64:~]$ getconf PAGE_SIZE
4096

EDIT: though I'm not sure whether that command really shows the number you need.

@aaronjanse
Copy link
Member

Heads up this segfault is reproducible in binfmt running on x86_64, cross-compiling to aarch64

@peti
Copy link
Member

peti commented Sep 18, 2020

I'm trying to compile a proper ghc compiler with f38ed04 reverted just to see what happens. I'll probably have a result in a couple of hours. My Raspberry Pi 4b is on it ...

@vcunat
Copy link
Member Author

vcunat commented Sep 18, 2020

I think it might need patchelf override for all ghc-built packages (certainly for some others during ghc boostrap). At least until the patchelf bug gets found and fixed. The problem is that I saw no way of doing such a wider override.

@domenkozar
Copy link
Member

What if we only need to set the correct PAGE_SIZE in Nix builds?

@delroth
Copy link
Contributor

delroth commented Sep 19, 2020

The thing I don't get here is that the new patchelf behavior has no reason to produce binaries that can't run on 4K pages ARMv8 systems. Generating ELF files with 64K section alignment is what binutils does, what LLVM does, and how ARMv8 binaries are shipped in pretty much every single distro in the world. I'm still trying to find time to debug this.

@delroth
Copy link
Contributor

delroth commented Sep 19, 2020

I'm managing to reproduce the crash by taking the ghc binary pre-patchelf and running patchelf --set-rpath twice on it. The crash doesn't even happen in ghc, it's ld-linux that can't process the rpath for some reason.

[nix-shell:~]$ ldd /nix/store/413gmmlyhiik8ckxbhm8wvk0fqc3nclh-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc
Segmentation fault (core dumped)

But that's also reproducible on patchelf all the way to 0.9...

[nix-shell:~/patchelf]$ git checkout 0.9
Previous HEAD position was e1e39f3 Update release.nix
HEAD is now at 44b7f95 Update README

[nix-shell:~/patchelf]$ autoreconf -is && ./configure && make
...
make[1]: Leaving directory '/home/ubuntu/patchelf'

[nix-shell:~/patchelf]$ cp /nix/store/413gmmlyhiik8ckxbhm8wvk0fqc3nclh-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc.orig /tmp/ghc
[nix-shell:~/patchelf]$ src/patchelf --set-rpath /foo/bar/lol /tmp/ghc
[nix-shell:~/patchelf]$ src/patchelf --set-rpath /foo/bar/lol:/foo/bar/lol:/foo/bar/lol:/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol /tmp/ghc
[nix-shell:~/patchelf]$ ldd /tmp/ghc
Segmentation fault (core dumped)

The symptoms are exactly the same though, so I suspect there's a latent bug that happened to surface with 0.12 for some reason.

@delroth
Copy link
Contributor

delroth commented Sep 19, 2020

I've now tried running exactly the patchelf command that happens during the ghc build, with patchelf 0.9, 0.10, 0.11 and 0.12.

In all 4 cases, the resulting binary segfaults with the exact same symptoms. I'm puzzled as to how this ever worked frankly. patchelf is broken, but this is not new brokenness.

@delroth
Copy link
Contributor

delroth commented Sep 19, 2020

I finally managed to get a test case which passes on 0.11 and fails on 0.12:

cp ../ghc-8.6.5/ghc/stage2/build/tmp/ghc-stage2 /tmp/ghc
src/patchelf --replace-needed libncursesw.so.5 libncurses.so --replace-needed libtinfo.so libtinfo.so.5 --interpreter /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/ld-linux-aarch64.so.1 /tmp/ghc
strip /tmp/ghc
src/patchelf --set-rpath '/nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib:/nix/store/qsxgr8vk6y8m95r7jf3qxrkz648g8p91-gmp-6.2.0/lib:$ORIGIN/../haskeline-0.7.4.3:$ORIGIN/../stm-2.5.0.0:$ORIGIN/../ghc-8.6.5:$ORIGIN/../terminfo-0.4.1.2:$ORIGIN/../process-1.6.5.0:$ORIGIN/../hpc-0.6.0.3:$ORIGIN/../ghci-8.6.5:$ORIGIN/../transformers-0.5.6.2:$ORIGIN/../template-haskell-2.14.0.0:$ORIGIN/../pretty-1.1.3.6:$ORIGIN/../ghc-heap-8.6.5:$ORIGIN/../ghc-boot-8.6.5:$ORIGIN/../ghc-boot-th-8.6.5:$ORIGIN/../directory-1.3.3.0:$ORIGIN/../unix-2.7.2.2:$ORIGIN/../time-1.8.0.2:$ORIGIN/../filepath-1.4.2.1:$ORIGIN/../binary-0.8.6.0:$ORIGIN/../containers-0.6.0.1:$ORIGIN/../bytestring-0.10.8.2:$ORIGIN/../deepseq-1.4.4.0:$ORIGIN/../array-0.5.3.0:$ORIGIN/../base-4.12.0.0:$ORIGIN/../integer-gmp-1.0.2.0:$ORIGIN/../ghc-prim-0.5.3:$ORIGIN/../rts' /tmp/ghc
ldd /tmp/ghc

Using this I bisected the patchelf change to 0470d6921b5a3fe8e92e356c8e11d120dbbb06c0 which is indeed the 4K->64K alignment change on ARMv8. I still suspect this is a completely unrelated patchelf issue that was only narrowly avoided by luck before, given that the GHC stage2 binary is itself aligned to 64K originally.

Stripping was necessary to reproduce the failure in my test case, so maybe disabling stripping is all that's needed to luck into making this work again. I'm trying a build with dontStrip = true to see if that does indeed do something.

delroth added a commit to delroth/nixpkgs that referenced this issue Sep 19, 2020
There seems to be a latent bug with strip + patchelf on the GHC stage2
binary which got triggered again by patchelf 0.12. Until this is
properly fixed in patchelf, re-disable stripping.

Fixes NixOS#97407.
@delroth
Copy link
Contributor

delroth commented Sep 19, 2020

dontStrip = true in the ghc865Binary derivation seems to be a good enough workaround to produce a valid pandoc binary down the line, so I would suggest merging that workaround for now. This is what was in place before it was removed in b930b2d, and the comment in the old code suggests someone hit that exact same problem with patchelf+strip. I sent out #98265.

@vcunat you said that this workaround (disabling stripping) didn't work for you earlier in the bug. I'm kind of confused because it clearly does on my box. Could you reconfirm?

@domenkozar
Copy link
Member

domenkozar commented Sep 19, 2020

I still get segfaults with that patch:

builder for '/nix/store/h1rf84jdgm54cwgawahbfd7irhk3sw43-happy-1.19.12.drv' failed with exit code 139; last 10 log lines:
  building
  /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg)
  /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc-pkg: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../terminfo-0.4.1.2/libHSterminfo-0.4.1.2-ghc8.6.5.so)
  Preprocessing executable 'happy' for happy-1.19.12..
  Building executable 'happy' for happy-1.19.12..
  /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../haskeline-0.7.4.3/libHShaskeline-0.7.4.3-ghc8.6.5.so)
  /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../ghc-8.6.5/libHSghc-8.6.5-ghc8.6.5.so)
  /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc: /nix/store/mbd1w6i6b98wbh5pf0ikghs1fz5p24j7-ncurses-6.2-abi5-compat/lib/libtinfo.so.5: no version information available (required by /nix/store/9p5l3zcw3kgrhpx293z3yflpwlnfpd45-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/../terminfo-0.4.1.2/libHSterminfo-0.4.1.2-ghc8.6.5.so)
  [ 1 of 19] Compiling AbsSyn           ( src/AbsSyn.lhs, dist/build/happy/happy-tmp/AbsSyn.o )
  /nix/store/k832pghqg9z887j8py47ddhwzrn4yj1f-stdenv-linux/setup: line 1302:   249 Segmentation fault      (core dumped) ./Setup build

@delroth
Copy link
Contributor

delroth commented Sep 19, 2020

The same derivation builds fine here:

~/nixpkgs$ nix-store -r /nix/store/h1rf84jdgm54cwgawahbfd7irhk3sw43-happy-1.19.12.drv
warning: you did not specify '--add-root'; the result might be removed by the garbage collector
/nix/store/88b13147iaaicc586a8421frv07c50d8-happy-1.19.12-data
/nix/store/9l9i919v6929i8drv07cc8nmn3f3hr17-happy-1.19.12

(On an EC2 Graviton2 instance, 4K page size.)

@domenkozar
Copy link
Member

diff --git a/pkgs/development/compilers/ghc/8.6.5-binary.nix b/pkgs/development/compilers/ghc/8.6.5-binary.nix
index 41af279e83f..2bed9f017d3 100644
--- a/pkgs/development/compilers/ghc/8.6.5-binary.nix
+++ b/pkgs/development/compilers/ghc/8.6.5-binary.nix
@@ -55,6 +55,8 @@ stdenv.mkDerivation rec {
   nativeBuildInputs = [ perl ];
   propagatedBuildInputs = stdenv.lib.optionals useLLVM [ llvmPackages.llvm ];
 
+  dontStrip = true;
+
   # Cannot patchelf beforehand due to relative RPATHs that anticipate
   # the final install location/
   ${libEnvVar} = libPath;

@vcunat
Copy link
Member Author

vcunat commented Oct 10, 2020

OK, the I suspect the machine maintainers might be very busy, so in the meantime let's try cachix (but note this installation hint). For now I uploaded the finished step:

/nix/store/ky2gp7mg828vcygi9d82q2prqcxql1pw-ghc-8.6.5-binary

as all followup ones fail very fast and incomplete builds are probably not as easy to share (though I can try if you want them).

@lostnet
Copy link
Contributor

lostnet commented Oct 10, 2020

Hi @vcunat the only difference in ky2g.. seems to be the order things were added to the package.cache, which I hope doesn't matter.

Looking at the HsColour, I think it builds twice as a ghc865Binary and is exactly the same except the second time can call itself to print out the warnings in color, etc.. From the derivation in your output, I think it was building the second which might mean the problem is that the first copy was stripped.

Can you check that you have:
rmwamr0wx3hm2v9wi3si6bmxkjzc9gga-hscolour-1.24.4
but not:
vi594cv9f7gi3jl9pf4z3nqq7fzax8pw-hscolour-1.24.4

And if so add rmw..hscolour to cachix?
Thanks

@vcunat
Copy link
Member Author

vcunat commented Oct 10, 2020

The rmw path is what fails to build (af7*.drv above). I'm pushing the partial build directory as /nix/store/w35dra6854p4iyxnkqlnzn0c5xkcna58-nix-build-hscolour-1.24.4.drv-1 and the log as /nix/store/84l9qcdg2zrx42fbmk0wwbd70lcj99pc-af7an5c65km09ssrk9axhdg6mmrag1rk-hscolour-1.24.4.log.

@lostnet
Copy link
Contributor

lostnet commented Oct 11, 2020

Hi @vcunat I understand a bit more of the context now from those logs, yet I'm not really sure if the stage in faults at would make damage to a library not used in the earlier compiles a likely cause or not. I pushed a new branch with smaller rpaths and some printing of the elf section layout to hopefully arrive at a determination. I also added ghc8.10.2Binary in a way that is running on the rock64 (currently in stage1 build of ghc901).

If you could try ae3b2eb fafc65e (forgot to bring the new branch up to date) out for both ghc and haskell.compiler.ghc901 on the community server I would appreciate it!

@vcunat
Copy link
Member Author

vcunat commented Oct 12, 2020

ghc looks the same to me at a quick glance: hscolour and happy segfault in ./Setup build

@lostnet
Copy link
Contributor

lostnet commented Oct 12, 2020

Hi @vcunat if you can add the 3 new logs(ghcbinary, happy, hscolour) I think there should be enough information for me to investigate what is unique/common at the points it segfaults from tracing good builds.

On ghc901/rock64 it bootstraps fine, going on to xmonad fails since setlocale is whitelisting base <8.15, whitelisting to the latest official might be a common blocking point in libraries for 9.0.X? Yet, if this build doesn't fail on the community server, then I think using the 8.10.2Binary to bootstrap 8.10.2 (and earlier?) might work and/or it might add to the comparisons that can be made to figure out the cause of the segvs.

Thanks!

@vcunat
Copy link
Member Author

vcunat commented Oct 12, 2020

/nix/store/d57r799wrbahbhimm5va3sxmyidhhy2s-73bb6b5kl0v4aqmaircfaba7635z2g4w-hscolour-1.24.4.log
/nix/store/mj1r2qjajxvdzm63yx04wbgmy28yf3gf-vhsm55j6inrpy6qcyif8yzzb0vayg2cw-ghc-8.6.5-binary.log
/nix/store/vk5yjmzspj09aiwg82gjnpdcsni4aalv-x84ly0ympwb59db12wipzinnmz139iv4-happy-1.19.12.log

BTW, ghc901 has passed those stages and it still keeps building the compiler.

@vcunat
Copy link
Member Author

vcunat commented Oct 12, 2020

haskell.compiler.ghc901 build succeeded.

@lostnet
Copy link
Contributor

lostnet commented Oct 13, 2020

I've started a build of ghc8102 bumped to be dependent on ghc8012Binary, since there is a patch it is technically imperfect, but will probably work. Would it make sense to raise the default ghc 8.10.2 now anyway? Then it could make sense to build ghc8.10.2 with itself on aarch64 and 8.6.5Binary on the others.

@lostnet
Copy link
Contributor

lostnet commented Oct 14, 2020

Hi @vcunat, 8.10.2Binary was able to bootstrap itself fine for me on aarch64 and x86_64, so I've put together an option for working around the problem by:

Switching 8.10.2 to boot from its own binaries on all architectures.
Making it the default ghc on aarch64.

If it looks like a reasonable option to you, please try out 2764755 for ghc and some ghc.* packages. I'm currently building with it to xmonad-with-packages. on the rock64.

Thanks,

@domenkozar
Copy link
Member

We can't switch the default GHC by a major version since it's following the latest LTS stackage release.

@vcunat
Copy link
Member Author

vcunat commented Oct 15, 2020

In the meantime, the machine finished haskellPackages.xmonad on 2764755.

@lostnet
Copy link
Contributor

lostnet commented Oct 17, 2020

I think the explanation for the different behavior on the rock64 (Cortex-A53) and the ~64 core community machine is <8.10.1 Out of Order Execution for the SEGV when using the compiler. (This seems consistent with the segfaults happening with high probability in the 1st build with parallelizable work and if not in the second such round.)

We could do various things to try to get 8.6/8.8 into the cache and I think they would work fine on small/old machines (though these should be becoming rare, i.e. the next in the rock64 line was the rock64Pro which has two A72 cores that could not be used.)

@lostnet
Copy link
Contributor

lostnet commented Oct 18, 2020

Hmm, now that I said the rock64 is probably not going to hit these, I finally got segfaults on the rock64, by building ghc865.happy or ghc865.hscolour with /nix/store/349hpr41jk4s2g1naw8mpbdsdhkd47z8-ghc-8.6.5/bin/ghc from the cache (not ghc865Binary). Some logs look exactly like the ones on community, but in the core I could get the timing seems to be different:

<no location info>: warning: [-Wmissing-home-modules]
    These modules are needed for compilation but not listed in your .cabal file's other-modules:
...
        Language.Haskell.HsColour.Options
        Language.Haskell.HsColour.Output
        Language.Haskell.HsColour.TTY
Segmentation fault

Reading symbols from Setup...
[New LWP 2743]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libthread_db.so.1".
Core was generated by `./Setup build'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000ffffb88512e8 in kill ()
   from /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libc.so.6

(gdb) backtrace                                                                                                                                                                          [12/1423]
#0  0x0000ffffb88512e8 in kill ()
   from /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libc.so.6
#1  0x0000000000405130 in exitBySignal (sig=sig@entry=11) at rts/RtsStartup.c:597
#2  0x000000000125dbe4 in shutdownHaskellAndSignal (sig=11, fastExit=<optimized out>)
    at rts/RtsStartup.c:562
#3  0x00000000011c7fe0 in ?? ()

(gdb) info threads
  Id   Target Id                        Frame
* 1    Thread 0xffffb8b63010 (LWP 2743) 0x0000ffffb88512e8 in kill ()
   from /nix/store/mj4hk2z68aqcxpl8nr0an5gspbz69gvv-glibc-2.31/lib/libc.so.6

@lostnet
Copy link
Contributor

lostnet commented Oct 19, 2020

Hi @vcunat if you can try out 4d79bc6 that would be great. It should boot ghc884 using 8.10.2Binary. On the rock64 it is building the stage2 now so I'm not sure what else might be wrong, but whether 8.8 gets a SEGV when building itself in stage2 is probably the most important factor.
Thanks!

@vcunat
Copy link
Member Author

vcunat commented Oct 20, 2020

haskellPackages.xmonad built without issues.

@peti
Copy link
Member

peti commented Jan 18, 2021

Can we close this issue?

@vcunat vcunat closed this as completed Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug 0.kind: regression Something that worked before working no longer 6.topic: haskell
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants